Reasoning Parser#
SGLang supports parsing reasoning content out from “normal” content for reasoning models such as DeepSeek R1.
Supported Models & Parsers#
Model |
Reasoning tags |
Parser |
Notes |
---|---|---|---|
|
|
Supports all variants (R1, R1-0528, R1-Distill) |
|
|
|
Supports |
|
|
|
Supports |
|
|
|
Always generates thinking content |
|
|
|
Uses special thinking delimiters |
|
|
|
N/A |
Model-Specific Behaviors#
DeepSeek-R1 Family:
DeepSeek-R1: No
<think>
start tag, jumps directly to thinking contentDeepSeek-R1-0528: Generates both
<think>
start and</think>
end tagsBoth are handled by the same
deepseek-r1
parser
DeepSeek-V3 Family:
DeepSeek-V3.1: Hybrid model supporting both thinking and non-thinking modes, use the
deepseek-v3
parser andthinking
parameter (NOTE: notenable_thinking
)
Qwen3 Family:
Standard Qwen3 (e.g., Qwen3-2507): Use
qwen3
parser, supportsenable_thinking
in chat templatesQwen3-Thinking (e.g., Qwen3-235B-A22B-Thinking-2507): Use
qwen3
orqwen3-thinking
parser, always thinks
Kimi:
Kimi: Uses special
◁think▷
and◁/think▷
tags
GPT OSS:
GPT OSS: Uses special
<|channel|>analysis<|message|>
and<|end|>
tags
Usage#
Launching the Server#
Specify the --reasoning-parser
option.
[1]:
import requests
from openai import OpenAI
from sglang.test.doc_patch import launch_server_cmd
from sglang.utils import wait_for_server, print_highlight, terminate_process
server_process, port = launch_server_cmd(
"python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-R1-Distill-Qwen-7B --host 0.0.0.0 --reasoning-parser deepseek-r1 --log-level warning"
)
wait_for_server(f"http://localhost:{port}")
/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
import pynvml # type: ignore[import]
/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
import pynvml # type: ignore[import]
`torch_dtype` is deprecated! Use `dtype` instead!
WARNING:transformers.configuration_utils:`torch_dtype` is deprecated! Use `dtype` instead!
WARNING:sglang.srt.server_args:
########################################################################
# For contributors and developers: #
# Please move environment variable definitions to sglang.srt.environ #
# using the following pattern: #
# SGLANG_XXX = EnvBool(False) #
# #
########################################################################
All deep_gemm operations loaded successfully!
/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
import pynvml # type: ignore[import]
/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
import pynvml # type: ignore[import]
`torch_dtype` is deprecated! Use `dtype` instead!
[2025-09-28 18:23:03] `torch_dtype` is deprecated! Use `dtype` instead!
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[2025-09-28 18:23:05] MOE_RUNNER_BACKEND is not initialized, using triton backend
All deep_gemm operations loaded successfully!
Loading safetensors checkpoint shards: 0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 50% Completed | 1/2 [00:01<00:01, 1.32s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:03<00:00, 1.76s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:03<00:00, 1.69s/it]
Capturing batches (bs=1 avail_mem=19.86 GB): 100%|██████████| 3/3 [00:14<00:00, 4.98s/it]
NOTE: Typically, the server runs in a separate terminal.
In this notebook, we run the server and notebook code together, so their outputs are combined.
To improve clarity, the server logs are displayed in the original black color, while the notebook outputs are highlighted in blue.
To reduce the log length, we set the log level to warning for the server, the default log level is info.
We are running those notebooks in a CI environment, so the throughput is not representative of the actual performance.
Note that --reasoning-parser
defines the parser used to interpret responses.
OpenAI Compatible API#
Using the OpenAI compatible API, the contract follows the DeepSeek API design established with the release of DeepSeek-R1:
reasoning_content
: The content of the CoT.content
: The content of the final answer.
[2]:
# Initialize OpenAI-like client
client = OpenAI(api_key="None", base_url=f"http://0.0.0.0:{port}/v1")
model_name = client.models.list().data[0].id
messages = [
{
"role": "user",
"content": "What is 1+3?",
}
]
Non-Streaming Request#
[3]:
response_non_stream = client.chat.completions.create(
model=model_name,
messages=messages,
temperature=0.6,
top_p=0.95,
stream=False, # Non-streaming
extra_body={"separate_reasoning": True},
)
print_highlight("==== Reasoning ====")
print_highlight(response_non_stream.choices[0].message.reasoning_content)
print_highlight("==== Text ====")
print_highlight(response_non_stream.choices[0].message.content)
Next, I perform the addition of these two numbers to find their sum.
Finally, I conclude that the result of adding 1 and 3 is 4.
We need to find the sum of \(1\) and \(3\).
\[
1 + 3 = 4
\]
Therefore, the final answer is \(\boxed{4}\).
Streaming Request#
[4]:
response_stream = client.chat.completions.create(
model=model_name,
messages=messages,
temperature=0.6,
top_p=0.95,
stream=True, # Non-streaming
extra_body={"separate_reasoning": True},
)
reasoning_content = ""
content = ""
for chunk in response_stream:
if chunk.choices[0].delta.content:
content += chunk.choices[0].delta.content
if chunk.choices[0].delta.reasoning_content:
reasoning_content += chunk.choices[0].delta.reasoning_content
print_highlight("==== Reasoning ====")
print_highlight(reasoning_content)
print_highlight("==== Text ====")
print_highlight(content)
Next, I'll add the two numbers together.
Finally, I'll provide the result, which is 4.
**Solution:**
We are asked to find the sum of 1 and 3.
1. **Identify the numbers to add:**
\[
1 \quad \text{and} \quad 3
\]
2. **Perform the addition:**
\[
1 + 3 = 4
\]
3. **Present the final answer:**
\[
\boxed{4}
\]
**Answer:** \(\boxed{4}\)
Optionally, you can buffer the reasoning content to the last reasoning chunk (or the first chunk after the reasoning content).
[5]:
response_stream = client.chat.completions.create(
model=model_name,
messages=messages,
temperature=0.6,
top_p=0.95,
stream=True, # Non-streaming
extra_body={"separate_reasoning": True, "stream_reasoning": False},
)
reasoning_content = ""
content = ""
for chunk in response_stream:
if chunk.choices[0].delta.content:
content += chunk.choices[0].delta.content
if chunk.choices[0].delta.reasoning_content:
reasoning_content += chunk.choices[0].delta.reasoning_content
print_highlight("==== Reasoning ====")
print_highlight(reasoning_content)
print_highlight("==== Text ====")
print_highlight(content)
Next, I perform the addition of these two numbers.
Finally, I arrive at the sum of 4.
**Solution:**
We are asked to find the sum of 1 and 3.
1. **Identify the numbers to add:**
\[
1 \quad \text{and} \quad 3
\]
2. **Perform the addition:**
\[
1 + 3 = 4
\]
**Final Answer:**
\[
\boxed{4}
\]
The reasoning separation is enable by default when specify . To disable it, set the ``separate_reasoning`` option to ``False`` in request.
[6]:
response_non_stream = client.chat.completions.create(
model=model_name,
messages=messages,
temperature=0.6,
top_p=0.95,
stream=False, # Non-streaming
extra_body={"separate_reasoning": False},
)
print_highlight("==== Original Output ====")
print_highlight(response_non_stream.choices[0].message.content)
Next, I'll add these two numbers together.
Finally, the sum of 1 and 3 is 4.
Sure! Let's solve the addition problem step by step.
**Question:** What is \(1 + 3\)?
**Solution:**
1. **Identify the numbers to add:**
\[
1 \quad \text{and} \quad 3
\]
2. **Add the numbers together:**
\[
1 + 3 = 4
\]
**Final Answer:**
\[
\boxed{4}
\]
SGLang Native API#
[7]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/DeepSeek-R1-Distill-Qwen-7B")
input = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
)
gen_url = f"http://localhost:{port}/generate"
gen_data = {
"text": input,
"sampling_params": {
"skip_special_tokens": False,
"max_new_tokens": 1024,
"temperature": 0.6,
"top_p": 0.95,
},
}
gen_response = requests.post(gen_url, json=gen_data).json()["text"]
print_highlight("==== Original Output ====")
print_highlight(gen_response)
parse_url = f"http://localhost:{port}/separate_reasoning"
separate_reasoning_data = {
"text": gen_response,
"reasoning_parser": "deepseek-r1",
}
separate_reasoning_response_json = requests.post(
parse_url, json=separate_reasoning_data
).json()
print_highlight("==== Reasoning ====")
print_highlight(separate_reasoning_response_json["reasoning_text"])
print_highlight("==== Text ====")
print_highlight(separate_reasoning_response_json["text"])
Next, I perform the addition operation by combining these two numbers.
Finally, I calculate the sum, which is 4.
Sure! Let's solve the problem step by step.
**Question:** What is \(1 + 3\)?
**Solution:**
1. **Identify the numbers:**
- First number: \(1\)
- Second number: \(3\)
2. **Add the numbers:**
\[
1 + 3 = 4
\]
**Answer:** \(\boxed{4}\)
Next, I perform the addition operation by combining these two numbers.
Finally, I calculate the sum, which is 4.
**Question:** What is \(1 + 3\)?
**Solution:**
1. **Identify the numbers:**
- First number: \(1\)
- Second number: \(3\)
2. **Add the numbers:**
\[
1 + 3 = 4
\]
**Answer:** \(\boxed{4}\)
[8]:
terminate_process(server_process)
Offline Engine API#
[9]:
import sglang as sgl
from sglang.srt.parser.reasoning_parser import ReasoningParser
from sglang.utils import print_highlight
llm = sgl.Engine(model_path="deepseek-ai/DeepSeek-R1-Distill-Qwen-7B")
tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/DeepSeek-R1-Distill-Qwen-7B")
input = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
)
sampling_params = {
"max_new_tokens": 1024,
"skip_special_tokens": False,
"temperature": 0.6,
"top_p": 0.95,
}
result = llm.generate(prompt=input, sampling_params=sampling_params)
generated_text = result["text"] # Assume there is only one prompt
print_highlight("==== Original Output ====")
print_highlight(generated_text)
parser = ReasoningParser("deepseek-r1")
reasoning_text, text = parser.parse_non_stream(generated_text)
print_highlight("==== Reasoning ====")
print_highlight(reasoning_text)
print_highlight("==== Text ====")
print_highlight(text)
All deep_gemm operations loaded successfully!
`torch_dtype` is deprecated! Use `dtype` instead!
WARNING:transformers.configuration_utils:`torch_dtype` is deprecated! Use `dtype` instead!
/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
import pynvml # type: ignore[import]
/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
import pynvml # type: ignore[import]
`torch_dtype` is deprecated! Use `dtype` instead!
[2025-09-28 18:23:49] `torch_dtype` is deprecated! Use `dtype` instead!
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
All deep_gemm operations loaded successfully!
Loading safetensors checkpoint shards: 0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 50% Completed | 1/2 [00:01<00:01, 1.61s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:03<00:00, 1.51s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:03<00:00, 1.52s/it]
Capturing batches (bs=1 avail_mem=21.35 GB): 100%|██████████| 20/20 [00:16<00:00, 1.19it/s]
Next, I perform the addition of these two numbers.
Finally, I calculate the sum to determine that 1 plus 3 equals 4.
**Solution:**
We need to calculate the sum of the numbers 1 and 3.
\[
1 + 3 = 4
\]
**Answer:** \(\boxed{4}\)
Next, I perform the addition of these two numbers.
Finally, I calculate the sum to determine that 1 plus 3 equals 4.
We need to calculate the sum of the numbers 1 and 3.
\[
1 + 3 = 4
\]
**Answer:** \(\boxed{4}\)
[10]:
llm.shutdown()
Supporting New Reasoning Model Schemas#
For future reasoning models, you can implement the reasoning parser as a subclass of BaseReasoningFormatDetector
in python/sglang/srt/reasoning_parser.py
and specify the reasoning parser for new reasoning model schemas accordingly.