Reward Models#
These models output a scalar reward score or classification result, often used in reinforcement learning or content moderation tasks.
Important
They are executed with --is-embedding
and some may require --trust-remote-code
.
Example launch Command#
python3 -m sglang.launch_server \
--model-path Qwen/Qwen2.5-Math-RM-72B \ # example HF/local path
--is-embedding \
--host 0.0.0.0 \
--tp-size=4 \ # set for tensor parallelism
--port 30000 \
Supporting Matrixs#
Model Family (Reward) |
Example HuggingFace Identifier |
Description |
---|---|---|
Llama (3.1 Reward / |
|
Reward model (preference classifier) based on Llama 3.1 (8B) for scoring and ranking responses for RLHF. |
Gemma 2 (27B Reward / |
|
Derived from Gemma‑2 (27B), this model provides human preference scoring for RLHF and multilingual tasks. |
InternLM 2 (Reward / |
|
InternLM 2 (7B)–based reward model used in alignment pipelines to guide outputs toward preferred behavior. |
Qwen2.5 (Reward - Math / |
|
A 72B math-specialized RLHF reward model from the Qwen2.5 series, tuned for evaluating and refining responses. |
Qwen2.5 (Reward - Sequence / |
|
A smaller Qwen2.5 variant used for sequence classification, offering an alternative RLHF scoring mechanism. |