SGLang Router#

The SGLang Router is a high-performance request distribution system that routes inference requests across multiple SGLang runtime instances. It features cache-aware load balancing, fault tolerance, and support for advanced deployment patterns including data parallelism and prefill-decode disaggregation.

Key Features#

  • Cache-Aware Load Balancing: Optimizes cache utilization while maintaining balanced load distribution

  • Multiple Routing Policies: Choose from random, round-robin, cache-aware, or power-of-two policies

  • Fault Tolerance: Automatic retry and circuit breaker mechanisms for resilient operation

  • Dynamic Scaling: Add or remove workers at runtime without service interruption

  • Kubernetes Integration: Native service discovery and pod management

  • Prefill-Decode Disaggregation: Support for disaggregated serving load balancing

  • Prometheus Metrics: Built-in observability and monitoring

Installation#

pip install sglang-router

Quick Start#

To see all available options:

python -m sglang_router.launch_server --help  # Co-launch router and workers
python -m sglang_router.launch_router --help  # Launch router only

Deployment Modes#

The router supports three primary deployment patterns:

  1. Co-launch Mode: Router and workers launch together (simplest for single-node deployments)

  2. Separate Launch Mode: Router and workers launch independently (best for multi-node setups)

  3. Prefill-Decode Disaggregation: Specialized mode for disaggregated serving

Mode 1: Co-launch Router and Workers#

This mode launches both the router and multiple worker instances in a single command. It’s the simplest deployment option and replaces the --dp-size argument of SGLang Runtime.

# Launch router with 4 workers
python -m sglang_router.launch_server \
    --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \
    --dp-size 4 \
    --host 0.0.0.0 \
    --port 30000

Sending Requests#

Once the server is ready, send requests to the router endpoint:

import requests

# Using the /generate endpoint
url = "http://localhost:30000/generate"
data = {
    "text": "What is the capital of France?",
    "sampling_params": {
        "temperature": 0.7,
        "max_new_tokens": 100
    }
}

response = requests.post(url, json=data)
print(response.json())

# OpenAI-compatible endpoint
url = "http://localhost:30000/v1/chat/completions"
data = {
    "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
    "messages": [{"role": "user", "content": "What is the capital of France?"}]
}

response = requests.post(url, json=data)
print(response.json())

Mode 2: Separate Launch Mode#

This mode is ideal for multi-node deployments where workers run on different machines.

Step 1: Launch Workers#

On each worker node:

# Worker node 1
python -m sglang.launch_server \
    --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \
    --host 0.0.0.0 \
    --port 8000

# Worker node 2
python -m sglang.launch_server \
    --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \
    --host 0.0.0.0 \
    --port 8001

Step 2: Launch Router#

On the router node:

python -m sglang_router.launch_router \
    --worker-urls http://worker1:8000 http://worker2:8001 \
    --host 0.0.0.0 \
    --port 30000 \
    --policy cache_aware  # or random, round_robin, power_of_two

Mode 3: Prefill-Decode Disaggregation#

This advanced mode separates prefill and decode operations for optimized performance:

python -m sglang_router.launch_router \
    --pd-disaggregation \
    --prefill http://prefill1:8000 9000 \
    --prefill http://prefill2:8001 9001 \
    --decode http://decode1:8002 \
    --decode http://decode2:8003 \
    --prefill-policy cache_aware \
    --decode-policy round_robin

Understanding –prefill Arguments#

The --prefill flag accepts URLs with optional bootstrap ports:

  • --prefill http://server:8000 - No bootstrap port

  • --prefill http://server:8000 9000 - Bootstrap port 9000

  • --prefill http://server:8000 none - Explicitly no bootstrap port

Policy Inheritance in PD Mode#

The router intelligently handles policy configuration for prefill and decode nodes:

  1. Only --policy specified: Both prefill and decode nodes use this policy

  2. --policy and --prefill-policy specified: Prefill nodes use --prefill-policy, decode nodes use --policy

  3. --policy and --decode-policy specified: Prefill nodes use --policy, decode nodes use --decode-policy

  4. All three specified: Prefill nodes use --prefill-policy, decode nodes use --decode-policy (main --policy is ignored)

Example with mixed policies:

python -m sglang_router.launch_router \
    --pd-disaggregation \
    --prefill http://prefill1:8000
    --prefill http://prefill2:8000 \
    --decode http://decode1:8001
    --decode http://decode2:8001 \
    --policy round_robin \
    --prefill-policy cache_aware  # Prefill uses cache_aware and decode uses round_robin from --policy

PD Mode with Service Discovery#

For Kubernetes deployments with separate prefill and decode server pools:

python -m sglang_router.launch_router \
    --pd-disaggregation \
    --service-discovery \
    --prefill-selector app=prefill-server tier=gpu \
    --decode-selector app=decode-server tier=cpu \
    --service-discovery-namespace production \
    --prefill-policy cache_aware \
    --decode-policy round_robin

Dynamic Scaling#

The router supports runtime scaling through REST APIs:

Adding Workers#

# Launch a new worker
python -m sglang.launch_server \
    --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \
    --port 30001

# Add it to the router
curl -X POST "http://localhost:30000/add_worker?url=http://127.0.0.1:30001"

Removing Workers#

curl -X POST "http://localhost:30000/remove_worker?url=http://127.0.0.1:30001"

Note: When using cache-aware routing, removed workers are cleanly evicted from the routing tree and request queues.

Fault Tolerance#

The router includes comprehensive fault tolerance mechanisms:

Retry Configuration#

python -m sglang_router.launch_router \
    --worker-urls http://worker1:8000 http://worker2:8001 \
    --retry-max-retries 3 \
    --retry-initial-backoff-ms 100 \
    --retry-max-backoff-ms 10000 \
    --retry-backoff-multiplier 2.0 \
    --retry-jitter-factor 0.1

Circuit Breaker#

Protects against cascading failures:

python -m sglang_router.launch_router \
    --worker-urls http://worker1:8000 http://worker2:8001 \
    --cb-failure-threshold 5 \
    --cb-success-threshold 2 \
    --cb-timeout-duration-secs 30 \
    --cb-window-duration-secs 60

Behavior:

  • Worker is marked unhealthy after cb-failure-threshold consecutive failures

  • Returns to service after cb-success-threshold successful health checks

  • Circuit breaker can be disabled with --disable-circuit-breaker

Routing Policies#

The router supports multiple routing strategies:

1. Random Routing#

Distributes requests randomly across workers.

--policy random

2. Round-Robin Routing#

Cycles through workers in order.

--policy round_robin

3. Power of Two Choices#

Samples two workers and routes to the less loaded one.

--policy power_of_two

4. Cache-Aware Load Balancing (Default)#

The most sophisticated policy that combines cache optimization with load balancing:

--policy cache_aware \
--cache-threshold 0.5 \
--balance-abs-threshold 32 \
--balance-rel-threshold 1.0001

How It Works#

  1. Load Assessment: Checks if the system is balanced

    • Imbalanced if: (max_load - min_load) > balance_abs_threshold AND max_load > balance_rel_threshold * min_load

  2. Routing Decision:

    • Balanced System: Uses cache-aware routing

      • Routes to worker with highest prefix match if match > cache_threshold

      • Otherwise routes to worker with most available cache capacity

    • Imbalanced System: Uses shortest queue routing to the least busy worker

  3. Cache Management:

    • Maintains approximate radix trees per worker

    • Periodically evicts LRU entries based on --eviction-interval and --max-tree-size

Data Parallelism Aware Routing#

Enables fine-grained control over data parallel replicas:

--dp-aware \
--api-key your_api_key  # Required for worker authentication

This mode coordinates with SGLang’s DP controller for optimized request distribution across data parallel ranks.

Configuration Reference#

Core Settings#

Parameter

Type

Default

Description

--host

str

127.0.0.1

Router server host address

--port

int

30000

Router server port

--worker-urls

list

[]

Worker URLs for separate launch mode

--policy

str

cache_aware

Routing policy (random, round_robin, cache_aware, power_of_two)

--max-concurrent-requests

int

64

Maximum concurrent requests (rate limiting)

--request-timeout-secs

int

600

Request timeout in seconds

--max-payload-size

int

256MB

Maximum request payload size

Cache-Aware Routing Parameters#

Parameter

Type

Default

Description

--cache-threshold

float

0.5

Minimum prefix match ratio for cache routing (0.0-1.0)

--balance-abs-threshold

int

32

Absolute load difference threshold

--balance-rel-threshold

float

1.0001

Relative load ratio threshold

--eviction-interval

int

60

Seconds between cache eviction cycles

--max-tree-size

int

16777216

Maximum nodes in routing tree

Fault Tolerance Parameters#

Parameter

Type

Default

Description

--retry-max-retries

int

3

Maximum retry attempts per request

--retry-initial-backoff-ms

int

100

Initial retry backoff in milliseconds

--retry-max-backoff-ms

int

10000

Maximum retry backoff in milliseconds

--retry-backoff-multiplier

float

2.0

Backoff multiplier between retries

--retry-jitter-factor

float

0.1

Random jitter factor for retries

--disable-retries

flag

False

Disable retry mechanism

--cb-failure-threshold

int

5

Failures before circuit opens

--cb-success-threshold

int

2

Successes to close circuit

--cb-timeout-duration-secs

int

30

Circuit breaker timeout duration

--cb-window-duration-secs

int

60

Circuit breaker window duration

--disable-circuit-breaker

flag

False

Disable circuit breaker

Prefill-Decode Disaggregation Parameters#

Parameter

Type

Default

Description

--pd-disaggregation

flag

False

Enable PD disaggregated mode

--prefill

list

[]

Prefill server URLs with optional bootstrap ports

--decode

list

[]

Decode server URLs

--prefill-policy

str

None

Routing policy for prefill nodes (overrides –policy)

--decode-policy

str

None

Routing policy for decode nodes (overrides –policy)

--worker-startup-timeout-secs

int

300

Timeout for worker startup

--worker-startup-check-interval

int

10

Interval between startup checks

Kubernetes Integration#

Parameter

Type

Default

Description

--service-discovery

flag

False

Enable Kubernetes service discovery

--selector

list

[]

Label selector for workers (key1=value1 key2=value2)

--prefill-selector

list

[]

Label selector for prefill servers in PD mode

--decode-selector

list

[]

Label selector for decode servers in PD mode

--service-discovery-port

int

80

Port for discovered pods

--service-discovery-namespace

str

None

Kubernetes namespace to watch

--bootstrap-port-annotation

str

sglang.ai/bootstrap-port

Annotation for bootstrap ports

Observability#

Parameter

Type

Default

Description

--prometheus-port

int

29000

Prometheus metrics port

--prometheus-host

str

127.0.0.1

Prometheus metrics host

--log-dir

str

None

Directory for log files

--log-level

str

info

Logging level (debug, info, warning, error, critical)

--request-id-headers

list

None

Custom headers for request tracing

CORS Configuration#

Parameter

Type

Default

Description

--cors-allowed-origins

list

[]

Allowed CORS origins

Advanced Features#

Kubernetes Service Discovery#

Automatically discover and manage workers in Kubernetes:

Standard Mode#

python -m sglang_router.launch_router \
    --service-discovery \
    --selector app=sglang-worker env=prod \
    --service-discovery-namespace production \
    --service-discovery-port 8000

Prefill-Decode Disaggregation Mode#

python -m sglang_router.launch_router \
    --pd-disaggregation \
    --service-discovery \
    --prefill-selector app=prefill-server env=prod \
    --decode-selector app=decode-server env=prod \
    --service-discovery-namespace production

Note: The --bootstrap-port-annotation (default: sglang.ai/bootstrap-port) is used to discover bootstrap ports for prefill servers in PD mode. Prefill pods should have this annotation set to their bootstrap port value.

Prometheus Metrics#

Expose metrics for monitoring:

python -m sglang_router.launch_router \
    --worker-urls http://worker1:8000 http://worker2:8001 \
    --prometheus-port 29000 \
    --prometheus-host 0.0.0.0

Metrics available at http://localhost:29000/metrics

Request Tracing#

Enable request ID tracking:

python -m sglang_router.launch_router \
    --worker-urls http://worker1:8000 http://worker2:8001 \
    --request-id-headers x-request-id x-trace-id

Troubleshooting#

Common Issues#

  1. Workers not connecting: Ensure workers are fully initialized before starting the router. Use --worker-startup-timeout-secs to increase wait time.

  2. High latency: Check if cache-aware routing is causing imbalance. Try adjusting --balance-abs-threshold and --balance-rel-threshold.

  3. Memory growth: Reduce --max-tree-size or decrease --eviction-interval for more aggressive cache cleanup.

  4. Circuit breaker triggering frequently: Increase --cb-failure-threshold or extend --cb-window-duration-secs.

Debug Mode#

Enable detailed logging:

python -m sglang_router.launch_router \
    --worker-urls http://worker1:8000 http://worker2:8001 \
    --log-level debug \
    --log-dir ./router_logs