SGLang Router#
The SGLang Router is a high-performance request distribution system that routes inference requests across multiple SGLang runtime instances. It features cache-aware load balancing, fault tolerance, and support for advanced deployment patterns including data parallelism and prefill-decode disaggregation.
Key Features#
Cache-Aware Load Balancing: Optimizes cache utilization while maintaining balanced load distribution
Multiple Routing Policies: Choose from random, round-robin, cache-aware, or power-of-two policies
Fault Tolerance: Automatic retry and circuit breaker mechanisms for resilient operation
Dynamic Scaling: Add or remove workers at runtime without service interruption
Kubernetes Integration: Native service discovery and pod management
Prefill-Decode Disaggregation: Support for disaggregated serving load balancing
Prometheus Metrics: Built-in observability and monitoring
Installation#
pip install sglang-router
Quick Start#
To see all available options:
python -m sglang_router.launch_server --help # Co-launch router and workers
python -m sglang_router.launch_router --help # Launch router only
Deployment Modes#
The router supports three primary deployment patterns:
Co-launch Mode: Router and workers launch together (simplest for single-node deployments)
Separate Launch Mode: Router and workers launch independently (best for multi-node setups)
Prefill-Decode Disaggregation: Specialized mode for disaggregated serving
Mode 1: Co-launch Router and Workers#
This mode launches both the router and multiple worker instances in a single command. It’s the simplest deployment option and replaces the --dp-size
argument of SGLang Runtime.
# Launch router with 4 workers
python -m sglang_router.launch_server \
--model-path meta-llama/Meta-Llama-3.1-8B-Instruct \
--dp-size 4 \
--host 0.0.0.0 \
--port 30000
Sending Requests#
Once the server is ready, send requests to the router endpoint:
import requests
# Using the /generate endpoint
url = "http://localhost:30000/generate"
data = {
"text": "What is the capital of France?",
"sampling_params": {
"temperature": 0.7,
"max_new_tokens": 100
}
}
response = requests.post(url, json=data)
print(response.json())
# OpenAI-compatible endpoint
url = "http://localhost:30000/v1/chat/completions"
data = {
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
"messages": [{"role": "user", "content": "What is the capital of France?"}]
}
response = requests.post(url, json=data)
print(response.json())
Mode 2: Separate Launch Mode#
This mode is ideal for multi-node deployments where workers run on different machines.
Step 1: Launch Workers#
On each worker node:
# Worker node 1
python -m sglang.launch_server \
--model-path meta-llama/Meta-Llama-3.1-8B-Instruct \
--host 0.0.0.0 \
--port 8000
# Worker node 2
python -m sglang.launch_server \
--model-path meta-llama/Meta-Llama-3.1-8B-Instruct \
--host 0.0.0.0 \
--port 8001
Step 2: Launch Router#
On the router node:
python -m sglang_router.launch_router \
--worker-urls http://worker1:8000 http://worker2:8001 \
--host 0.0.0.0 \
--port 30000 \
--policy cache_aware # or random, round_robin, power_of_two
Mode 3: Prefill-Decode Disaggregation#
This advanced mode separates prefill and decode operations for optimized performance:
python -m sglang_router.launch_router \
--pd-disaggregation \
--prefill http://prefill1:8000 9000 \
--prefill http://prefill2:8001 9001 \
--decode http://decode1:8002 \
--decode http://decode2:8003 \
--prefill-policy cache_aware \
--decode-policy round_robin
Understanding –prefill Arguments#
The --prefill
flag accepts URLs with optional bootstrap ports:
--prefill http://server:8000
- No bootstrap port--prefill http://server:8000 9000
- Bootstrap port 9000--prefill http://server:8000 none
- Explicitly no bootstrap port
Policy Inheritance in PD Mode#
The router intelligently handles policy configuration for prefill and decode nodes:
Only
--policy
specified: Both prefill and decode nodes use this policy--policy
and--prefill-policy
specified: Prefill nodes use--prefill-policy
, decode nodes use--policy
--policy
and--decode-policy
specified: Prefill nodes use--policy
, decode nodes use--decode-policy
All three specified: Prefill nodes use
--prefill-policy
, decode nodes use--decode-policy
(main--policy
is ignored)
Example with mixed policies:
python -m sglang_router.launch_router \
--pd-disaggregation \
--prefill http://prefill1:8000
--prefill http://prefill2:8000 \
--decode http://decode1:8001
--decode http://decode2:8001 \
--policy round_robin \
--prefill-policy cache_aware # Prefill uses cache_aware and decode uses round_robin from --policy
PD Mode with Service Discovery#
For Kubernetes deployments with separate prefill and decode server pools:
python -m sglang_router.launch_router \
--pd-disaggregation \
--service-discovery \
--prefill-selector app=prefill-server tier=gpu \
--decode-selector app=decode-server tier=cpu \
--service-discovery-namespace production \
--prefill-policy cache_aware \
--decode-policy round_robin
Dynamic Scaling#
The router supports runtime scaling through REST APIs:
Adding Workers#
# Launch a new worker
python -m sglang.launch_server \
--model-path meta-llama/Meta-Llama-3.1-8B-Instruct \
--port 30001
# Add it to the router
curl -X POST "http://localhost:30000/add_worker?url=http://127.0.0.1:30001"
Removing Workers#
curl -X POST "http://localhost:30000/remove_worker?url=http://127.0.0.1:30001"
Note: When using cache-aware routing, removed workers are cleanly evicted from the routing tree and request queues.
Fault Tolerance#
The router includes comprehensive fault tolerance mechanisms:
Retry Configuration#
python -m sglang_router.launch_router \
--worker-urls http://worker1:8000 http://worker2:8001 \
--retry-max-retries 3 \
--retry-initial-backoff-ms 100 \
--retry-max-backoff-ms 10000 \
--retry-backoff-multiplier 2.0 \
--retry-jitter-factor 0.1
Circuit Breaker#
Protects against cascading failures:
python -m sglang_router.launch_router \
--worker-urls http://worker1:8000 http://worker2:8001 \
--cb-failure-threshold 5 \
--cb-success-threshold 2 \
--cb-timeout-duration-secs 30 \
--cb-window-duration-secs 60
Behavior:
Worker is marked unhealthy after
cb-failure-threshold
consecutive failuresReturns to service after
cb-success-threshold
successful health checksCircuit breaker can be disabled with
--disable-circuit-breaker
Routing Policies#
The router supports multiple routing strategies:
1. Random Routing#
Distributes requests randomly across workers.
--policy random
2. Round-Robin Routing#
Cycles through workers in order.
--policy round_robin
3. Power of Two Choices#
Samples two workers and routes to the less loaded one.
--policy power_of_two
4. Cache-Aware Load Balancing (Default)#
The most sophisticated policy that combines cache optimization with load balancing:
--policy cache_aware \
--cache-threshold 0.5 \
--balance-abs-threshold 32 \
--balance-rel-threshold 1.0001
How It Works#
Load Assessment: Checks if the system is balanced
Imbalanced if:
(max_load - min_load) > balance_abs_threshold
ANDmax_load > balance_rel_threshold * min_load
Routing Decision:
Balanced System: Uses cache-aware routing
Routes to worker with highest prefix match if match >
cache_threshold
Otherwise routes to worker with most available cache capacity
Imbalanced System: Uses shortest queue routing to the least busy worker
Cache Management:
Maintains approximate radix trees per worker
Periodically evicts LRU entries based on
--eviction-interval
and--max-tree-size
Data Parallelism Aware Routing#
Enables fine-grained control over data parallel replicas:
--dp-aware \
--api-key your_api_key # Required for worker authentication
This mode coordinates with SGLang’s DP controller for optimized request distribution across data parallel ranks.
Configuration Reference#
Core Settings#
Parameter |
Type |
Default |
Description |
---|---|---|---|
|
str |
127.0.0.1 |
Router server host address |
|
int |
30000 |
Router server port |
|
list |
[] |
Worker URLs for separate launch mode |
|
str |
cache_aware |
Routing policy (random, round_robin, cache_aware, power_of_two) |
|
int |
64 |
Maximum concurrent requests (rate limiting) |
|
int |
600 |
Request timeout in seconds |
|
int |
256MB |
Maximum request payload size |
Cache-Aware Routing Parameters#
Parameter |
Type |
Default |
Description |
---|---|---|---|
|
float |
0.5 |
Minimum prefix match ratio for cache routing (0.0-1.0) |
|
int |
32 |
Absolute load difference threshold |
|
float |
1.0001 |
Relative load ratio threshold |
|
int |
60 |
Seconds between cache eviction cycles |
|
int |
16777216 |
Maximum nodes in routing tree |
Fault Tolerance Parameters#
Parameter |
Type |
Default |
Description |
---|---|---|---|
|
int |
3 |
Maximum retry attempts per request |
|
int |
100 |
Initial retry backoff in milliseconds |
|
int |
10000 |
Maximum retry backoff in milliseconds |
|
float |
2.0 |
Backoff multiplier between retries |
|
float |
0.1 |
Random jitter factor for retries |
|
flag |
False |
Disable retry mechanism |
|
int |
5 |
Failures before circuit opens |
|
int |
2 |
Successes to close circuit |
|
int |
30 |
Circuit breaker timeout duration |
|
int |
60 |
Circuit breaker window duration |
|
flag |
False |
Disable circuit breaker |
Prefill-Decode Disaggregation Parameters#
Parameter |
Type |
Default |
Description |
---|---|---|---|
|
flag |
False |
Enable PD disaggregated mode |
|
list |
[] |
Prefill server URLs with optional bootstrap ports |
|
list |
[] |
Decode server URLs |
|
str |
None |
Routing policy for prefill nodes (overrides –policy) |
|
str |
None |
Routing policy for decode nodes (overrides –policy) |
|
int |
300 |
Timeout for worker startup |
|
int |
10 |
Interval between startup checks |
Kubernetes Integration#
Parameter |
Type |
Default |
Description |
---|---|---|---|
|
flag |
False |
Enable Kubernetes service discovery |
|
list |
[] |
Label selector for workers (key1=value1 key2=value2) |
|
list |
[] |
Label selector for prefill servers in PD mode |
|
list |
[] |
Label selector for decode servers in PD mode |
|
int |
80 |
Port for discovered pods |
|
str |
None |
Kubernetes namespace to watch |
|
str |
sglang.ai/bootstrap-port |
Annotation for bootstrap ports |
Observability#
Parameter |
Type |
Default |
Description |
---|---|---|---|
|
int |
29000 |
Prometheus metrics port |
|
str |
127.0.0.1 |
Prometheus metrics host |
|
str |
None |
Directory for log files |
|
str |
info |
Logging level (debug, info, warning, error, critical) |
|
list |
None |
Custom headers for request tracing |
CORS Configuration#
Parameter |
Type |
Default |
Description |
---|---|---|---|
|
list |
[] |
Allowed CORS origins |
Advanced Features#
Kubernetes Service Discovery#
Automatically discover and manage workers in Kubernetes:
Standard Mode#
python -m sglang_router.launch_router \
--service-discovery \
--selector app=sglang-worker env=prod \
--service-discovery-namespace production \
--service-discovery-port 8000
Prefill-Decode Disaggregation Mode#
python -m sglang_router.launch_router \
--pd-disaggregation \
--service-discovery \
--prefill-selector app=prefill-server env=prod \
--decode-selector app=decode-server env=prod \
--service-discovery-namespace production
Note: The --bootstrap-port-annotation
(default: sglang.ai/bootstrap-port
) is used to discover bootstrap ports for prefill servers in PD mode. Prefill pods should have this annotation set to their bootstrap port value.
Prometheus Metrics#
Expose metrics for monitoring:
python -m sglang_router.launch_router \
--worker-urls http://worker1:8000 http://worker2:8001 \
--prometheus-port 29000 \
--prometheus-host 0.0.0.0
Metrics available at http://localhost:29000/metrics
Request Tracing#
Enable request ID tracking:
python -m sglang_router.launch_router \
--worker-urls http://worker1:8000 http://worker2:8001 \
--request-id-headers x-request-id x-trace-id
Troubleshooting#
Common Issues#
Workers not connecting: Ensure workers are fully initialized before starting the router. Use
--worker-startup-timeout-secs
to increase wait time.High latency: Check if cache-aware routing is causing imbalance. Try adjusting
--balance-abs-threshold
and--balance-rel-threshold
.Memory growth: Reduce
--max-tree-size
or decrease--eviction-interval
for more aggressive cache cleanup.Circuit breaker triggering frequently: Increase
--cb-failure-threshold
or extend--cb-window-duration-secs
.
Debug Mode#
Enable detailed logging:
python -m sglang_router.launch_router \
--worker-urls http://worker1:8000 http://worker2:8001 \
--log-level debug \
--log-dir ./router_logs