Bench Serving Guide#
This guide explains how to benchmark online serving throughput and latency using python -m sglang.bench_serving
. It supports multiple inference backends via OpenAI-compatible and native endpoints, and produces both console metrics and optional JSONL outputs.
What it does#
Generates synthetic or dataset-driven prompts and submits them to a target serving endpoint
Measures throughput, time-to-first-token (TTFT), inter-token latency (ITL), per-request end-to-end latency, and more
Supports streaming or non-streaming modes, rate control, and concurrency limits
Supported backends and endpoints#
sglang
/sglang-native
:POST /generate
sglang-oai
,vllm
,lmdeploy
:POST /v1/completions
sglang-oai-chat
,vllm-chat
,lmdeploy-chat
:POST /v1/chat/completions
trt
(TensorRT-LLM):POST /v2/models/ensemble/generate_stream
gserver
: Custom server (Not Implemented yet in this script)truss
:POST /v1/models/model:predict
If --base-url
is provided, requests are sent to it. Otherwise, --host
and --port
are used. When --model
is not provided, the script will attempt to query GET /v1/models
for an available model ID (OpenAI-compatible endpoints).
Prerequisites#
Python 3.8+
Dependencies typically used by this script:
aiohttp
,numpy
,requests
,tqdm
,transformers
, and for some datasetsdatasets
,pillow
,pybase64
. Install as needed.An inference server running and reachable via the endpoints above
If your server requires authentication, set environment variable
OPENAI_API_KEY
(used asAuthorization: Bearer <key>
)
Quick start#
Run a basic benchmark against an sglang server exposing /generate
:
python3 -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct
python3 -m sglang.bench_serving \
--backend sglang \
--host 127.0.0.1 --port 30000 \
--num-prompts 1000 \
--model meta-llama/Llama-3.1-8B-Instruct
Or, using an OpenAI-compatible endpoint (completions):
python3 -m sglang.bench_serving \
--backend vllm \
--base-url http://127.0.0.1:8000 \
--num-prompts 1000 \
--model meta-llama/Llama-3.1-8B-Instruct
Datasets#
Select with --dataset-name
:
sharegpt
(default): loads ShareGPT-style pairs; optionally restrict with--sharegpt-context-len
and override outputs with--sharegpt-output-len
random
: random text lengths; sampled from ShareGPT token spacerandom-ids
: random token ids (can lead to gibberish)random-image
: generates random images and wraps them in chat messages; supports custom resolutions via ‘heightxwidth’ formatgenerated-shared-prefix
: synthetic dataset with shared long system prompts and short questionsmmmu
: samples from MMMU (Math split) and includes images
Common dataset flags:
--num-prompts N
: number of requests--random-input-len
,--random-output-len
,--random-range-ratio
: for random/random-ids/random-image--random-image-num-images
,--random-image-resolution
: for random-image dataset (supports presets 1080p/720p/360p or custom ‘heightxwidth’ format)--apply-chat-template
: apply tokenizer chat template when constructing prompts--dataset-path PATH
: file path for ShareGPT json; if blank and missing, it will be downloaded and cached
Generated Shared Prefix flags (for generated-shared-prefix
):
--gsp-num-groups
--gsp-prompts-per-group
--gsp-system-prompt-len
--gsp-question-len
--gsp-output-len
Random Image dataset flags (for random-image
):
--random-image-num-images
: Number of images per request--random-image-resolution
: Image resolution; supports presets (1080p, 720p, 360p) or custom ‘heightxwidth’ format (e.g., 1080x1920, 512x768)
Examples#
To benchmark random-image dataset with 3 images per request, 500 prompts, 512 input length, and 512 output length, you can run:
python -m sglang.launch_server --model-path Qwen/Qwen2.5-VL-3B-Instruct --disable-radix-cache
python -m sglang.bench_serving \
--backend sglang-oai-chat \
--dataset-name random-image \
--num-prompts 500 \
--random-image-num-images 3 \
--random-image-resolution 720p \
--random-input-len 512 \
--random-output-len 512
To benchmark random dataset with 3000 prompts, 1024 input length, and 1024 output length, you can run:
python -m sglang.launch_server --model-path Qwen/Qwen2.5-3B-Instruct
python3 -m sglang.bench_serving \
--backend sglang \
--dataset-name random \
--num-prompts 3000 \
--random-input 1024 \
--random-output 1024 \
--random-range-ratio 0.5
Choosing model and tokenizer#
--model
is required unless the backend exposesGET /v1/models
, in which case the first model ID is auto-selected.--tokenizer
defaults to--model
. Both can be HF model IDs or local paths.For ModelScope workflows, setting
SGLANG_USE_MODELSCOPE=true
enables fetching via ModelScope (weights are skipped for speed).If your tokenizer lacks a chat template, the script warns because token counting can be less robust for gibberish outputs.
Rate, concurrency, and streaming#
--request-rate
: requests per second.inf
sends all immediately (burst). Non-infinite rate uses a Poisson process for arrival times.--max-concurrency
: caps concurrent in-flight requests regardless of arrival rate.--disable-stream
: switch to non-streaming mode when supported; TTFT then equals total latency for chat completions.
Other key options#
--output-file FILE.jsonl
: append JSONL results to file; auto-named if unspecified--output-details
: include per-request arrays (generated texts, errors, ttfts, itls, input/output lens)--extra-request-body '{"top_p":0.9,"temperature":0.6}'
: merged into payload (sampling params, etc.)--disable-ignore-eos
: pass through EOS behavior (varies by backend)--warmup-requests N
: run warmup requests with short output first (default 1)--flush-cache
: call/flush_cache
(sglang) before main run--profile
: call/start_profile
and/stop_profile
(requires server to enable profiling, e.g.,SGLANG_TORCH_PROFILER_DIR
)--lora-name name1 name2 ...
: randomly pick one per request and pass to backend (e.g.,lora_path
for sglang)--tokenize-prompt
: send integer IDs instead of text (currently supports--backend sglang
only)
Authentication#
If your target endpoint requires OpenAI-style auth, set:
export OPENAI_API_KEY=sk-...yourkey...
The script will add Authorization: Bearer $OPENAI_API_KEY
automatically for OpenAI-compatible routes.
Metrics explained#
Printed after each run:
Request throughput (req/s)
Input token throughput (tok/s)
Output token throughput (tok/s)
Total token throughput (tok/s)
Concurrency: aggregate time of all requests divided by wall time
End-to-End Latency (ms): mean/median/std/p99 per-request total latency
Time to First Token (TTFT, ms): mean/median/std/p99 for streaming mode
Inter-Token Latency (ITL, ms): mean/median/std/p95/p99/max between tokens
TPOT (ms): Token processing time after first token, i.e.,
(latency - ttft)/(tokens-1)
Accept length (sglang-only, if available): speculative decoding accept length
The script also retokenizes generated text with the configured tokenizer and reports “retokenized” counts.
JSONL output format#
When --output-file
is set, one JSON object is appended per run. Base fields:
Arguments summary: backend, dataset, request_rate, max_concurrency, etc.
Duration and totals: completed, total_input_tokens, total_output_tokens, retokenized totals
Throughputs and latency statistics as printed in the console
accept_length
when available (sglang)
With --output-details
, an extended object also includes arrays:
input_lens
,output_lens
ttfts
,itls
(per request: ITL arrays)generated_texts
,errors
End-to-end examples#
sglang native
/generate
(streaming):
python3 -m sglang.bench_serving \
--backend sglang \
--host 127.0.0.1 --port 30000 \
--model meta-llama/Llama-3.1-8B-Instruct \
--dataset-name random \
--random-input-len 1024 --random-output-len 1024 --random-range-ratio 0.5 \
--num-prompts 2000 \
--request-rate 100 \
--max-concurrency 512 \
--output-file sglang_random.jsonl --output-details
OpenAI-compatible Completions (e.g., vLLM):
python3 -m sglang.bench_serving \
--backend vllm \
--base-url http://127.0.0.1:8000 \
--model meta-llama/Llama-3.1-8B-Instruct \
--dataset-name sharegpt \
--num-prompts 1000 \
--sharegpt-output-len 256
OpenAI-compatible Chat Completions (streaming):
python3 -m sglang.bench_serving \
--backend vllm-chat \
--base-url http://127.0.0.1:8000 \
--model meta-llama/Llama-3.1-8B-Instruct \
--dataset-name random \
--num-prompts 500 \
--apply-chat-template
Random images (VLM) with chat template:
python3 -m sglang.bench_serving \
--backend sglang \
--host 127.0.0.1 --port 30000 \
--model your-vlm-model \
--dataset-name random-image \
--random-image-num-images 2 \
--random-image-resolution 720p \
--random-input-len 128 --random-output-len 256 \
--num-prompts 200 \
--apply-chat-template
4a) Random images with custom resolution:
python3 -m sglang.bench_serving \
--backend sglang \
--host 127.0.0.1 --port 30000 \
--model your-vlm-model \
--dataset-name random-image \
--random-image-num-images 1 \
--random-image-resolution 512x768 \
--random-input-len 64 --random-output-len 128 \
--num-prompts 100 \
--apply-chat-template
Generated shared prefix (long system prompts + short questions):
python3 -m sglang.bench_serving \
--backend sglang \
--host 127.0.0.1 --port 30000 \
--model meta-llama/Llama-3.1-8B-Instruct \
--dataset-name generated-shared-prefix \
--gsp-num-groups 64 --gsp-prompts-per-group 16 \
--gsp-system-prompt-len 2048 --gsp-question-len 128 --gsp-output-len 256 \
--num-prompts 1024
Tokenized prompts (ids) for strict length control (sglang only):
python3 -m sglang.bench_serving \
--backend sglang \
--host 127.0.0.1 --port 30000 \
--model meta-llama/Llama-3.1-8B-Instruct \
--dataset-name random \
--tokenize-prompt \
--random-input-len 2048 --random-output-len 256 --random-range-ratio 0.2
Profiling and cache flush (sglang):
python3 -m sglang.bench_serving \
--backend sglang \
--host 127.0.0.1 --port 30000 \
--model meta-llama/Llama-3.1-8B-Instruct \
--profile \
--flush-cache
TensorRT-LLM streaming endpoint:
python3 -m sglang.bench_serving \
--backend trt \
--base-url http://127.0.0.1:8000 \
--model your-trt-llm-model \
--dataset-name random \
--num-prompts 100 \
--disable-ignore-eos
Troubleshooting#
All requests failed: verify
--backend
, server URL/port,--model
, and authentication. Check warmup errors printed by the script.Throughput seems too low: adjust
--request-rate
and--max-concurrency
; verify server batch size/scheduling; ensure streaming is enabled if appropriate.Token counts look odd: prefer chat/instruct models with proper chat templates; otherwise tokenization of gibberish may be inconsistent.
Random-image/MMMU datasets: ensure you installed extra deps (
pillow
,datasets
,pybase64
).Authentication errors (401/403): set
OPENAI_API_KEY
or disable auth on your server.
Notes#
The script raises the file descriptor soft limit (
RLIMIT_NOFILE
) to help with many concurrent connections.For sglang,
/get_server_info
is queried post-run to report speculative decoding accept length when available.