SGLang Frontend Language#
SGLang frontend language can be used to define simple and easy prompts in a convenient, structured way.
Launch A Server#
Launch the server in your terminal and wait for it to initialize.
[1]:
from sglang import assistant_begin, assistant_end
from sglang import assistant, function, gen, system, user
from sglang import image
from sglang import RuntimeEndpoint
from sglang.lang.api import set_default_backend
from sglang.srt.utils import load_image
from sglang.test.doc_patch import launch_server_cmd
from sglang.utils import print_highlight, terminate_process, wait_for_server
server_process, port = launch_server_cmd(
"python -m sglang.launch_server --model-path Qwen/Qwen2.5-7B-Instruct --host 0.0.0.0 --log-level warning"
)
wait_for_server(f"http://localhost:{port}")
print(f"Server started on http://localhost:{port}")
/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
import pynvml # type: ignore[import]
/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
import pynvml # type: ignore[import]
`torch_dtype` is deprecated! Use `dtype` instead!
WARNING:transformers.configuration_utils:`torch_dtype` is deprecated! Use `dtype` instead!
WARNING:sglang.srt.server_args:
########################################################################
# For contributors and developers: #
# Please move environment variable definitions to sglang.srt.environ #
# using the following pattern: #
# SGLANG_XXX = EnvBool(False) #
# #
########################################################################
/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
import pynvml # type: ignore[import]
/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
import pynvml # type: ignore[import]
`torch_dtype` is deprecated! Use `dtype` instead!
[2025-09-30 08:09:03] `torch_dtype` is deprecated! Use `dtype` instead!
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[2025-09-30 08:09:04] MOE_RUNNER_BACKEND is not initialized, using triton backend
Loading safetensors checkpoint shards: 0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 25% Completed | 1/4 [00:00<00:02, 1.44it/s]
Loading safetensors checkpoint shards: 50% Completed | 2/4 [00:01<00:01, 1.27it/s]
Loading safetensors checkpoint shards: 75% Completed | 3/4 [00:02<00:00, 1.22it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00, 1.24it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00, 1.26it/s]
Capturing batches (bs=1 avail_mem=62.72 GB): 100%|██████████| 3/3 [00:00<00:00, 9.92it/s]
NOTE: Typically, the server runs in a separate terminal.
In this notebook, we run the server and notebook code together, so their outputs are combined.
To improve clarity, the server logs are displayed in the original black color, while the notebook outputs are highlighted in blue.
To reduce the log length, we set the log level to warning for the server, the default log level is info.
We are running those notebooks in a CI environment, so the throughput is not representative of the actual performance.
Server started on http://localhost:30366
Set the default backend. Note: Besides the local server, you may use also OpenAI
or other API endpoints.
[2]:
set_default_backend(RuntimeEndpoint(f"http://localhost:{port}"))
Basic Usage#
The most simple way of using SGLang frontend language is a simple question answer dialog between a user and an assistant.
[3]:
@function
def basic_qa(s, question):
s += system(f"You are a helpful assistant than can answer questions.")
s += user(question)
s += assistant(gen("answer", max_tokens=512))
[4]:
state = basic_qa("List 3 countries and their capitals.")
print_highlight(state["answer"])
1. France - Paris
2. Japan - Tokyo
3. Brazil - Brasília
Multi-turn Dialog#
SGLang frontend language can also be used to define multi-turn dialogs.
[5]:
@function
def multi_turn_qa(s):
s += system(f"You are a helpful assistant than can answer questions.")
s += user("Please give me a list of 3 countries and their capitals.")
s += assistant(gen("first_answer", max_tokens=512))
s += user("Please give me another list of 3 countries and their capitals.")
s += assistant(gen("second_answer", max_tokens=512))
return s
state = multi_turn_qa()
print_highlight(state["first_answer"])
print_highlight(state["second_answer"])
1. **France - Paris**
2. **Japan - Tokyo**
3. **Australia - Canberra**
1. **India - New Delhi**
2. **Germany - Berlin**
3. **Canada - Ottawa**
Control flow#
You may use any Python code within the function to define more complex control flows.
[6]:
@function
def tool_use(s, question):
s += assistant(
"To answer this question: "
+ question
+ ". I need to use a "
+ gen("tool", choices=["calculator", "search engine"])
+ ". "
)
if s["tool"] == "calculator":
s += assistant("The math expression is: " + gen("expression"))
elif s["tool"] == "search engine":
s += assistant("The key word to search is: " + gen("word"))
state = tool_use("What is 2 * 2?")
print_highlight(state["tool"])
print_highlight(state["expression"])
Let's calculate it:
2 * 2 = 4
No calculator was necessary for this computation, as it's a basic multiplication problem that most people can solve mentally.
Parallelism#
Use fork
to launch parallel prompts. Because sgl.gen
is non-blocking, the for loop below issues two generation calls in parallel.
[7]:
@function
def tip_suggestion(s):
s += assistant(
"Here are two tips for staying healthy: "
"1. Balanced Diet. 2. Regular Exercise.\n\n"
)
forks = s.fork(2)
for i, f in enumerate(forks):
f += assistant(
f"Now, expand tip {i+1} into a paragraph:\n"
+ gen("detailed_tip", max_tokens=256, stop="\n\n")
)
s += assistant("Tip 1:" + forks[0]["detailed_tip"] + "\n")
s += assistant("Tip 2:" + forks[1]["detailed_tip"] + "\n")
s += assistant(
"To summarize the above two tips, I can say:\n" + gen("summary", max_tokens=512)
)
state = tip_suggestion()
print_highlight(state["summary"])
- A balanced diet includes a variety of nutrient-rich foods such as fruits, vegetables, whole grains, lean proteins, and healthy fats.
- It helps improve energy levels, boost the immune system, and reduce the risk of chronic diseases.
2. **Regular Exercise**:
- Regular physical activity improves cardiovascular health, boosts immune function, and helps maintain a healthy weight.
- It releases endorphins that can reduce stress, combat depression and anxiety, and enhance overall quality of life.
- Aim for at least 150 minutes of moderate aerobic activity or 75 minutes of vigorous activity each week, along with muscle-strengthening exercises on two or more days a week.
By following these tips, you can significantly enhance your physical and mental health.
Constrained Decoding#
Use regex
to specify a regular expression as a decoding constraint. This is only supported for local models.
[8]:
@function
def regular_expression_gen(s):
s += user("What is the IP address of the Google DNS servers?")
s += assistant(
gen(
"answer",
temperature=0,
regex=r"((25[0-5]|2[0-4]\d|[01]?\d\d?).){3}(25[0-5]|2[0-4]\d|[01]?\d\d?)",
)
)
state = regular_expression_gen()
print_highlight(state["answer"])
Use regex
to define a JSON
decoding schema.
[9]:
character_regex = (
r"""\{\n"""
+ r""" "name": "[\w\d\s]{1,16}",\n"""
+ r""" "house": "(Gryffindor|Slytherin|Ravenclaw|Hufflepuff)",\n"""
+ r""" "blood status": "(Pure-blood|Half-blood|Muggle-born)",\n"""
+ r""" "occupation": "(student|teacher|auror|ministry of magic|death eater|order of the phoenix)",\n"""
+ r""" "wand": \{\n"""
+ r""" "wood": "[\w\d\s]{1,16}",\n"""
+ r""" "core": "[\w\d\s]{1,16}",\n"""
+ r""" "length": [0-9]{1,2}\.[0-9]{0,2}\n"""
+ r""" \},\n"""
+ r""" "alive": "(Alive|Deceased)",\n"""
+ r""" "patronus": "[\w\d\s]{1,16}",\n"""
+ r""" "bogart": "[\w\d\s]{1,16}"\n"""
+ r"""\}"""
)
@function
def character_gen(s, name):
s += user(
f"{name} is a character in Harry Potter. Please fill in the following information about this character."
)
s += assistant(gen("json_output", max_tokens=256, regex=character_regex))
state = character_gen("Harry Potter")
print_highlight(state["json_output"])
"name": "Harry Potter",
"house": "Gryffindor",
"blood status": "Half-blood",
"occupation": "student",
"wand": {
"wood": "Alder",
"core": "Phoenix Feather",
"length": 10.5
},
"alive": "Alive",
"patronus": "Stag",
"bogart": "Death"
}
Batching#
Use run_batch
to run a batch of prompts.
[10]:
@function
def text_qa(s, question):
s += user(question)
s += assistant(gen("answer", stop="\n"))
states = text_qa.run_batch(
[
{"question": "What is the capital of the United Kingdom?"},
{"question": "What is the capital of France?"},
{"question": "What is the capital of Japan?"},
],
progress_bar=True,
)
for i, state in enumerate(states):
print_highlight(f"Answer {i+1}: {states[i]['answer']}")
100%|██████████| 3/3 [00:00<00:00, 28.48it/s]
Streaming#
Use stream
to stream the output to the user.
[11]:
@function
def text_qa(s, question):
s += user(question)
s += assistant(gen("answer", stop="\n"))
state = text_qa.run(
question="What is the capital of France?", temperature=0.1, stream=True
)
for out in state.text_iter():
print(out, end="", flush=True)
<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
What is the capital of France?<|im_end|>
<|im_start|>assistant
The capital of France is Paris.<|im_end|>
Complex Prompts#
You may use {system|user|assistant}_{begin|end}
to define complex prompts.
[12]:
@function
def chat_example(s):
s += system("You are a helpful assistant.")
# Same as: s += s.system("You are a helpful assistant.")
with s.user():
s += "Question: What is the capital of France?"
s += assistant_begin()
s += "Answer: " + gen("answer", max_tokens=100, stop="\n")
s += assistant_end()
state = chat_example()
print_highlight(state["answer"])
[13]:
terminate_process(server_process)
Multi-modal Generation#
You may use SGLang frontend language to define multi-modal prompts. See here for supported models.
[14]:
server_process, port = launch_server_cmd(
"python -m sglang.launch_server --model-path Qwen/Qwen2.5-VL-7B-Instruct --host 0.0.0.0 --log-level warning"
)
wait_for_server(f"http://localhost:{port}")
print(f"Server started on http://localhost:{port}")
/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
import pynvml # type: ignore[import]
`torch_dtype` is deprecated! Use `dtype` instead!
WARNING:transformers.configuration_utils:`torch_dtype` is deprecated! Use `dtype` instead!
WARNING:sglang.srt.server_args:
########################################################################
# For contributors and developers: #
# Please move environment variable definitions to sglang.srt.environ #
# using the following pattern: #
# SGLANG_XXX = EnvBool(False) #
# #
########################################################################
[2025-09-30 08:09:32] MOE_RUNNER_BACKEND is not initialized, using triton backend
/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
import pynvml # type: ignore[import]
/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
import pynvml # type: ignore[import]
`torch_dtype` is deprecated! Use `dtype` instead!
[2025-09-30 08:09:40] `torch_dtype` is deprecated! Use `dtype` instead!
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[2025-09-30 08:09:42] MOE_RUNNER_BACKEND is not initialized, using triton backend
Loading safetensors checkpoint shards: 0% Completed | 0/5 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 20% Completed | 1/5 [00:00<00:03, 1.11it/s]
Loading safetensors checkpoint shards: 40% Completed | 2/5 [00:01<00:02, 1.11it/s]
Loading safetensors checkpoint shards: 60% Completed | 3/5 [00:02<00:01, 1.13it/s]
Loading safetensors checkpoint shards: 80% Completed | 4/5 [00:03<00:00, 1.16it/s]
Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:03<00:00, 1.51it/s]
Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:03<00:00, 1.31it/s]
Capturing batches (bs=1 avail_mem=60.80 GB): 100%|██████████| 3/3 [00:06<00:00, 2.08s/it]
NOTE: Typically, the server runs in a separate terminal.
In this notebook, we run the server and notebook code together, so their outputs are combined.
To improve clarity, the server logs are displayed in the original black color, while the notebook outputs are highlighted in blue.
To reduce the log length, we set the log level to warning for the server, the default log level is info.
We are running those notebooks in a CI environment, so the throughput is not representative of the actual performance.
Server started on http://localhost:35991
[15]:
set_default_backend(RuntimeEndpoint(f"http://localhost:{port}"))
Ask a question about an image.
[16]:
@function
def image_qa(s, image_file, question):
s += user(image(image_file) + question)
s += assistant(gen("answer", max_tokens=256))
image_url = "https://github.com/sgl-project/sglang/blob/main/test/lang/example_image.png?raw=true"
image_bytes, _ = load_image(image_url)
state = image_qa(image_bytes, "What is in the image?")
print_highlight(state["answer"])
[17]:
terminate_process(server_process)