Deterministic Inference#
Why Deterministic Inference Matters#
Deterministic inference ensures consistent LLM outputs across runs, which is critical for:
Reinforcement Learning: Ensures consistent logprobs across runs, reducing stochastic noise and making RL training more stable, reproducible, and debuggable.
Testing & Debugging: Enables reproducible validation
Production: Improves reliability and user experience
Even with temperature=0
, standard LLM inference can produce different outputs due to dynamic batching and varying reduction orders in GPU kernels.
The Root Cause of Non-Determinism#
The main source is varying batch sizes. Different batch sizes cause GPU kernels to split reduction operations differently, leading to different addition orders. Due to floating-point non-associativity ((a + b) + c ≠ a + (b + c)
), this produces different results even for identical inputs.
SGLang’s Solution#
Building on Thinking Machines Lab’s batch-invariant operators, SGLang achieves fully deterministic inference while maintaining compatibility with chunked prefill, CUDA graphs, radix cache, and non-greedy sampling. The development roadmap for deterministic inference features can be found in this issue.
Supported Backends#
Deterministic inference is only supported with the following three attention backends: FlashInfer, FlashAttention 3 (FA3), and Triton.
The following table shows feature compatibility for deterministic inference across different attention backends:
Attention Backend |
CUDA Graph |
Chunked Prefill |
Radix Cache |
Non-greedy Sampling (Temp > 0) |
---|---|---|---|---|
FlashInfer |
✅ Yes |
✅ Yes |
❌ No |
✅ Yes |
FlashAttention 3 (FA3) |
✅ Yes |
✅ Yes |
✅ Yes |
✅ Yes |
Triton |
✅ Yes |
✅ Yes |
✅ Yes |
✅ Yes |
Usage#
Basic Usage#
Enable deterministic inference by adding the --enable-deterministic-inference
flag:
python3 -m sglang.launch_server \
--model-path Qwen/Qwen3-8B \
--attention-backend fa3 \
--enable-deterministic-inference
Server Arguments#
Argument |
Type/Default |
Description |
---|---|---|
|
flag; default: disabled |
Enable deterministic inference with batch-invariant operations |
|
string; default: fa3 |
Choose attention backend (flashinfer, fa3, or triton) |
Example Configurations#
Qwen3-8B#
python3 -m sglang.launch_server \
--model-path Qwen/Qwen3-8B \
--attention-backend flashinfer \
--enable-deterministic-inference
Llama Models#
python3 -m sglang.launch_server \
--model-path meta-llama/Llama-3.1-8B-Instruct \
--attention-backend fa3 \
--enable-deterministic-inference
Qwen3-30B-A3B (MoE Model)#
python3 -m sglang.launch_server \
--model-path Qwen/Qwen3-30B-A3B \
--attention-backend fa3 \
--enable-deterministic-inference
Deterministic Inference with Non-Greedy Sampling (Temperature > 0)#
SGLang supports deterministic inference even with non-greedy sampling by using sampling seeds. This is particularly useful for reinforcement learning scenarios like GRPO (Group Relative Policy Optimization) where you need multiple diverse but reproducible responses.
Default Behavior#
By default, SGLang uses a sampling seed of 42
for reproducible sampling:
import requests
response = requests.post(
"http://localhost:30000/generate",
json={
"text": "Tell me a joke",
"sampling_params": {
"temperature": 0.8, # Non-greedy sampling
"max_new_tokens": 128,
},
},
)
print(response.json())
# This will always produce the same response across runs
Generating Multiple Reproducible Responses#
To sample different responses from the same prompt while maintaining reproducibility (e.g., for GRPO training), provide different sampling seeds in your requests:
import requests
# Prepare a list of sampling seeds for different responses
sampling_seeds = [42, 43, 44, 45, 46]
responses = []
for seed in sampling_seeds:
response = requests.post(
"http://localhost:30000/generate",
json={
"text": "Tell me a joke",
"sampling_params": {
"temperature": 0.8,
"max_new_tokens": 128,
"sampling_seed": seed, # Specify sampling seed
},
},
)
responses.append(response.json())
# Each seed will produce a different but reproducible response
# Using the same seed will always produce the same response
This approach ensures that:
Different seeds produce diverse responses
The same seed always produces the same response across different runs
Results are reproducible for debugging and evaluation
Verification#
Run deterministic tests to verify consistent outputs:
# Single test: same prompt, varying batch sizes
python3 -m sglang.test.test_deterministic --test-mode single --n-trials 50
# Prefix test: prompts with different prefix lengths
python3 -m sglang.test.test_deterministic --test-mode prefix --n-trials 50
# Radix Cache Consistency mode: test radix cache determinism (cached vs uncached prefill)
python3 -m sglang.test.test_deterministic --test-mode radix_cache
Expected result: All tests should show Unique samples: 1
(perfectly deterministic).