Checkpoint Engine Integration#

The SGLang checkpoint engine integration provides an efficient way to load model weights using a distributed checkpoint loading system. This feature significantly reduces model loading time, especially for large models and multi-node setups, by parallelizing the weight loading process across multiple processes and nodes.

Overview#

The checkpoint engine integration allows SGLang to:

  • Load model weights in parallel using multiple processes

  • Distribute weight loading across multiple nodes to increase effective disk bandwidth

  • Overlap weight loading with other initialization tasks like CUDA graph capture

  • Support both single-node and multi-node deployments

Installation#

First, install the checkpoint engine package:

pip install 'checkpoint-engine[p2p]'

Architecture#

The system consists of two main components:

  1. SGLang Server: Runs with --wait-for-initial-weights flag to wait for weights before becoming ready

  2. Checkpoint Engine Workers: Separate processes (managed by torchrun) that load and distribute model weights

The checkpoint engine uses a parameter server architecture with support for:

  • Broadcast mode: Weights are broadcast from loading processes to inference processes

  • P2P mode: Direct peer-to-peer weight transfer between processes

  • All mode: Combination of both broadcast and P2P methods

Usage Examples#

Single Node Setup#

Terminal 1 - Launch SGLang Server:

python -m sglang.launch_server \
    --model-path Qwen/Qwen3-8B \
    --tp 8 \
    --load-format dummy \
    --wait-for-initial-weights

Terminal 2 - Run Checkpoint Engine:

Using sglang entrypoint:

python -m sglang.srt.checkpoint_engine.update \
    --update-method broadcast \
    --checkpoint-path /path/to/Qwen/Qwen3-8B/ \
    --inference-parallel-size 8

Using torchrun directly:

torchrun --nproc-per-node 8 \
    examples/checkpoint_engine/update.py \
    --update-method broadcast \
    --checkpoint-path /path/to/Qwen/Qwen3-8B/ \
    --inference-parallel-size 8

Multi-Node Setup (2 Nodes)#

Node 0:

Launch SGLang server:

python -m sglang.launch_server \
    --model-path Qwen/Qwen3-8B \
    --tp 8 \
    --load-format dummy \
    --wait-for-initial-weights \
    --host [IP]

Run checkpoint engine:

Using sglang entrypoint (recommended):

python -m sglang.srt.checkpoint_engine.update \
    --update-method broadcast \
    --checkpoint-path /path/to/Qwen/Qwen3-8B/ \
    --inference-parallel-size 8

Using torchrun directly:

torchrun --nproc-per-node 8 \
    --nnodes 2 \
    --node-rank 0 \
    --master-addr [IP] \
    --master-port 29500 \
    examples/checkpoint_engine/update.py \
    --update-method broadcast \
    --checkpoint-path /path/to/Qwen/Qwen3-8B/ \
    --inference-parallel-size 8

Node 1:

Launch SGLang server:

python -m sglang.launch_server \
    --model-path Qwen/Qwen3-8B \
    --tp 8 \
    --load-format dummy \
    --wait-for-initial-weights \
    --host [IP]

Run checkpoint engine:

Using sglang entrypoint (recommended):

python -m sglang.srt.checkpoint_engine.update \
    --update-method broadcast \
    --checkpoint-path /path/to/Qwen/Qwen3-8B/ \
    --inference-parallel-size 8

Using torchrun directly:

torchrun --nproc-per-node 8 \
    --nnodes 2 \
    --node-rank 1 \
    --master-addr [IP] \
    --master-port 29500 \
    examples/checkpoint_engine/update.py \
    --update-method broadcast \
    --checkpoint-path /path/to/Qwen/Qwen3-8B/ \
    --inference-parallel-size 8

Multi-Node Setup with Tensor Parallelism (TP=16)#

Node 0:

Launch SGLang server:

python -m sglang.launch_server \
    --model-path Qwen/Qwen3-8B \
    --tp 8 \
    --load-format dummy \
    --wait-for-initial-weights \
    --host [IP] \
    --dist-init-addr [IP]:9120 \
    --nnodes 2 \
    --node-rank 0

Run checkpoint engine:

Using sglang entrypoint (recommended):

python -m sglang.srt.checkpoint_engine.update \
    --update-method broadcast \
    --checkpoint-path /path/to/Qwen/Qwen3-8B/ \
    --inference-parallel-size 16

Using torchrun directly:

torchrun --nproc-per-node 8 \
    --nnodes 2 \
    --node-rank 0 \
    --master-addr [IP] \
    --master-port 29500 \
    examples/checkpoint_engine/update.py \
    --update-method broadcast \
    --checkpoint-path /path/to/Qwen/Qwen3-8B/ \
    --inference-parallel-size 16

Node 1:

Launch SGLang server:

python -m sglang.launch_server \
    --model-path Qwen/Qwen3-8B \
    --tp 8 \
    --load-format dummy \
    --wait-for-initial-weights \
    --host [IP] \
    --dist-init-addr [IP]:9120 \
    --nnodes 2 \
    --node-rank 1

Run checkpoint engine:

Using sglang entrypoint (recommended):

python -m sglang.srt.checkpoint_engine.update \
    --update-method broadcast \
    --checkpoint-path /path/to/Qwen/Qwen3-8B/ \
    --inference-parallel-size 16

Using torchrun directly:

torchrun --nproc-per-node 8 \
    --nnodes 2 \
    --node-rank 1 \
    --master-addr [IP] \
    --master-port 29500 \
    examples/checkpoint_engine/update.py \
    --update-method broadcast \
    --checkpoint-path /path/to/Qwen/Qwen3-8B/ \
    --inference-parallel-size 16

Configuration Options#

SGLang Server Options#

  • --load-format dummy: Use dummy format for initial loading (allows overlapping with other tasks)

  • --wait-for-initial-weights: Wait for checkpoint engine to provide weights before becoming ready

  • --host: Host address for multi-node setups

  • --dist-init-addr: Distributed initialization address for tensor parallelism

Checkpoint Engine Options#

  • --update-method: Weight update method (broadcast, p2p, or all)

  • --checkpoint-path: Path to model checkpoint directory

  • --inference-parallel-size: Number of inference parallel processes

  • --endpoint: SGLang server endpoint (default: http://localhost:19730)

  • --checkpoint-name: Name for the checkpoint (default: my-checkpoint-iter-0)

  • --save-metas-file: File to save checkpoint metadata

  • --load-metas-file: File to load checkpoint metadata from

  • --uds: Unix domain socket path for communication

  • --weight-version: Version identifier for weights

Performance Benefits#

The checkpoint engine provides significant time savings in two main aspects:

  1. Multi-node Loading: Each node only loads a portion of weights from disk, effectively increasing disk bandwidth. More participating nodes provide greater acceleration. Preliminary tests show 20-second acceleration when loading DeepSeek-R1 on H20-3e with two nodes.

  2. Single Process Optimization: Using dummy format allows overlapping disk-to-CPU transfer with CUDA graph capture and other initialization tasks, providing additional time savings.

Troubleshooting#

  • Ensure checkpoint engine package is installed: pip install 'checkpoint-engine[p2p]'

  • Verify network connectivity between nodes in multi-node setups

  • Check that the checkpoint path contains valid model files

  • Monitor logs for connection errors between SGLang server and checkpoint engine

  • Use --sleep-time parameter to add delays if needed for debugging

References#