Environment Variables#
SGLang supports various environment variables that can be used to configure its runtime behavior. This document provides a comprehensive list and aims to stay updated over time.
Note: SGLang uses two prefixes for environment variables: SGL_
and SGLANG_
. This is likely due to historical reasons. While both are currently supported for different settings, future versions might consolidate them.
General Configuration#
Environment Variable |
Description |
Default Value |
---|---|---|
|
Enable using models from ModelScope |
|
|
Host IP address for the server |
|
|
Port for the server |
auto-detected |
|
Custom logging configuration path |
Not set |
|
Disable request logging |
|
|
Timeout for health check in seconds |
|
Performance Tuning#
Environment Variable |
Description |
Default Value |
---|---|---|
|
Control whether to use torch.inference_mode |
|
|
Enable torch.compile |
|
|
Enable CPU affinity setting (often set to |
|
|
Allows the scheduler to overwrite longer context length requests (often set to |
|
|
Control FlashInfer availability check |
|
|
Skip P2P (peer-to-peer) access check |
|
|
Sets the threshold for enabling chunked prefix caching |
|
|
Enable RoPE fusion in Fused Multi-Layer Attention |
|
DeepGEMM Configuration (Advanced Optimization)#
Environment Variable |
Description |
Default Value |
---|---|---|
|
Enable Just-In-Time compilation of DeepGEMM kernels |
|
|
Enable precompilation of DeepGEMM kernels |
|
|
Number of workers for parallel DeepGEMM kernel compilation |
|
|
Indicator flag used during the DeepGEMM precompile script |
|
|
Directory for caching compiled DeepGEMM kernels |
|
|
Use NVRTC (instead of Triton) for JIT compilation (Experimental) |
|
|
Use DeepGEMM for Batched Matrix Multiplication (BMM) operations |
|
Memory Management#
Environment Variable |
Description |
Default Value |
---|---|---|
|
Enable memory pool debugging |
|
|
Clip max new tokens estimation for memory planning |
Not set |
|
Maximum states for detokenizer |
Default value based on system |
|
Disable checks for memory imbalance across Tensor Parallel ranks |
Not set (defaults to enabled check) |
Model-Specific Options#
Environment Variable |
Description |
Default Value |
---|---|---|
|
Use AITER optimize implementation |
|
|
Enable INT4 weight quantization |
|
|
Enable MoE padding (sets padding size to 128 if value is |
|
|
Force using FP8 MARLIN kernels even if other FP8 kernels are available |
|
|
Use flashinfer kernels when running blockwise fp8 GEMM on Blackwell GPUs |
|
|
Use Cutlass kernels when running blockwise fp8 GEMM on Hopper or Blackwell GPUs |
|
|
Use Cutlass FP8 MoE kernel on Blackwell GPUs |
|
Distributed Computing#
Environment Variable |
Description |
Default Value |
---|---|---|
|
Control blocking of non-zero rank children processes |
|
|
Indicates if the current process is the first rank on its node |
|
|
Pipeline parallel layer partition specification |
Not set |
Testing & Debugging (Internal/CI)#
These variables are primarily used for internal testing, continuous integration, or debugging.
Environment Variable |
Description |
Default Value |
---|---|---|
|
Indicates if running in CI environment |
|
|
Indicates running in AMD CI environment |
|
|
Enable retract decode testing |
|
|
Record step time for profiling |
|
|
Test request time statistics |
|
|
Use small KV cache size in CI |
Not set |
Profiling & Benchmarking#
Environment Variable |
Description |
Default Value |
---|---|---|
|
Directory for PyTorch profiler output |
|
|
Set |
|
Storage & Caching#
Environment Variable |
Description |
Default Value |
---|---|---|
|
Disable Outlines disk cache |
|