Install SGLang#
You can install SGLang using any of the methods below.
For running DeepSeek V3/R1, refer to DeepSeek V3 Support. It is recommended to use the latest version and deploy it with Docker to avoid environment-related issues.
It is recommended to use uv to install the dependencies for faster installation:
Method 1: With pip or uv#
pip install --upgrade pip
pip install uv
uv pip install "sglang[all]>=0.4.5" --find-links https://flashinfer.ai/whl/cu124/torch2.5/flashinfer-python
Quick Fixes to Common Problems
SGLang currently uses torch 2.5, so you need to install flashinfer for torch 2.5. If you want to install flashinfer separately, please refer to FlashInfer installation doc. Please note that the FlashInfer pypi package is called
flashinfer-python
instead offlashinfer
.If you encounter
OSError: CUDA_HOME environment variable is not set
. Please set it to your CUDA install root with either of the following solutions:Use
export CUDA_HOME=/usr/local/cuda-<your-cuda-version>
to set theCUDA_HOME
environment variable.Install FlashInfer first following FlashInfer installation doc, then install SGLang as described above.
If you encounter
ImportError; cannot import name 'is_valid_list_of_images' from 'transformers.models.llama.image_processing_llama'
, try to use the specified version oftransformers
in pyproject.toml. Currently, just runningpip install transformers==4.48.3
.
Method 2: From source#
# Use the last release branch
git clone -b v0.4.5 https://github.com/sgl-project/sglang.git
cd sglang
pip install --upgrade pip
pip install -e "python[all]" --find-links https://flashinfer.ai/whl/cu124/torch2.5/flashinfer-python
Note: SGLang currently uses torch 2.5, so you need to install flashinfer for torch 2.5. If you want to install flashinfer separately, please refer to FlashInfer installation doc.
If you want to develop SGLang, it is recommended to use docker. Please refer to setup docker container for guidance. The docker image is lmsysorg/sglang:dev
.
Note: For AMD ROCm system with Instinct/MI GPUs, do following instead:
# Use the last release branch
git clone -b v0.4.5 https://github.com/sgl-project/sglang.git
cd sglang
pip install --upgrade pip
cd sgl-kernel
python setup_rocm.py install
cd ..
pip install -e "python[all_hip]"
Method 3: Using docker#
The docker images are available on Docker Hub as lmsysorg/sglang, built from Dockerfile.
Replace <secret>
below with your huggingface hub token.
docker run --gpus all \
--shm-size 32g \
-p 30000:30000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HF_TOKEN=<secret>" \
--ipc=host \
lmsysorg/sglang:latest \
python3 -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct --host 0.0.0.0 --port 30000
Note: For AMD ROCm system with Instinct/MI GPUs, it is recommended to use docker/Dockerfile.rocm
to build images, example and usage as below:
docker build --build-arg SGL_BRANCH=v0.4.5 -t v0.4.5-rocm630 -f Dockerfile.rocm .
alias drun='docker run -it --rm --network=host --device=/dev/kfd --device=/dev/dri --ipc=host \
--shm-size 16G --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined \
-v $HOME/dockerx:/dockerx -v /data:/data'
drun -p 30000:30000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HF_TOKEN=<secret>" \
v0.4.5-rocm630 \
python3 -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct --host 0.0.0.0 --port 30000
# Till flashinfer backend available, --attention-backend triton --sampling-backend pytorch are set by default
drun v0.4.5-rocm630 python3 -m sglang.bench_one_batch --batch-size 32 --input 1024 --output 128 --model amd/Meta-Llama-3.1-8B-Instruct-FP8-KV --tp 8 --quantization fp8
Method 4: Using docker compose#
More
This method is recommended if you plan to serve it as a service. A better approach is to use the k8s-sglang-service.yaml.
Copy the compose.yml to your local machine
Execute the command
docker compose up -d
in your terminal.
Method 5: Using Kubernetes#
More
Option 1: For single node serving (typically when the model size fits into GPUs on one node) Execute command
kubectl apply -f docker/k8s-sglang-service.yaml
, to create k8s deployment and service, with llama-31-8b as example.Option 2: For multi-node serving (usually when a large model requires more than one GPU node, such as
DeepSeek-R1
) Modify the LLM model path and arguments as necessary, then execute commandkubectl apply -f docker/k8s-sglang-distributed-sts.yaml
, to create two nodes k8s statefulset and serving service.
Method 6: Run on Kubernetes or Clouds with SkyPilot#
More
To deploy on Kubernetes or 12+ clouds, you can use SkyPilot.
Install SkyPilot and set up Kubernetes cluster or cloud access: see SkyPilot’s documentation.
Deploy on your own infra with a single command and get the HTTP API endpoint:
SkyPilot YAML: sglang.yaml
# sglang.yaml
envs:
HF_TOKEN: null
resources:
image_id: docker:lmsysorg/sglang:latest
accelerators: A100
ports: 30000
run: |
conda deactivate
python3 -m sglang.launch_server \
--model-path meta-llama/Llama-3.1-8B-Instruct \
--host 0.0.0.0 \
--port 30000
# Deploy on any cloud or Kubernetes cluster. Use --cloud <cloud> to select a specific cloud provider.
HF_TOKEN=<secret> sky launch -c sglang --env HF_TOKEN sglang.yaml
# Get the HTTP API endpoint
sky status --endpoint 30000 sglang
To further scale up your deployment with autoscaling and failure recovery, check out the SkyServe + SGLang guide.
Common Notes#
FlashInfer is the default attention kernel backend. It only supports sm75 and above. If you encounter any FlashInfer-related issues on sm75+ devices (e.g., T4, A10, A100, L4, L40S, H100), please switch to other kernels by adding
--attention-backend triton --sampling-backend pytorch
and open an issue on GitHub.If you only need to use OpenAI models with the frontend language, you can avoid installing other dependencies by using
pip install "sglang[openai]"
.The language frontend operates independently of the backend runtime. You can install the frontend locally without needing a GPU, while the backend can be set up on a GPU-enabled machine. To install the frontend, run
pip install sglang
, and for the backend, usepip install sglang[srt]
.srt
is the abbreviation of SGLang runtime.To reinstall flashinfer locally, use the following command:
pip install "flashinfer-python>=0.2.3" -i https://flashinfer.ai/whl/cu124/torch2.5 --force-reinstall --no-deps
and then delete the cache withrm -rf ~/.cache/flashinfer
.