Install SGLang#
You can install SGLang using any of the methods below.
Method 1: With pip#
pip install --upgrade pip
pip install "sglang[all]" --find-links https://flashinfer.ai/whl/cu124/torch2.4/flashinfer/
Note: Please check the FlashInfer installation doc to install the proper version according to your PyTorch and CUDA versions.
Method 2: From source#
# Use the last release branch
git clone -b v0.4.1.post6 https://github.com/sgl-project/sglang.git
cd sglang
pip install --upgrade pip
pip install -e "python[all]" --find-links https://flashinfer.ai/whl/cu124/torch2.4/flashinfer/
Note: Please check the FlashInfer installation doc to install the proper version according to your PyTorch and CUDA versions.
Note: To AMD ROCm system with Instinct/MI GPUs, do following instead:
# Use the last release branch
git clone -b v0.4.1.post6 https://github.com/sgl-project/sglang.git
cd sglang
pip install --upgrade pip
pip install -e "python[all_hip]"
Method 3: Using docker#
The docker images are available on Docker Hub as lmsysorg/sglang, built from Dockerfile.
Replace <secret>
below with your huggingface hub token.
docker run --gpus all \
--shm-size 32g \
-p 30000:30000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HF_TOKEN=<secret>" \
--ipc=host \
lmsysorg/sglang:latest \
python3 -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct --host 0.0.0.0 --port 30000
Note: To AMD ROCm system with Instinct/MI GPUs, it is recommended to use docker/Dockerfile.rocm
to build images, example and usage as below:
docker build --build-arg SGL_BRANCH=v0.4.1.post6 -t v0.4.1.post6-rocm620 -f Dockerfile.rocm .
alias drun='docker run -it --rm --network=host --device=/dev/kfd --device=/dev/dri --ipc=host \
--shm-size 16G --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined \
-v $HOME/dockerx:/dockerx -v /data:/data'
drun -p 30000:30000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HF_TOKEN=<secret>" \
v0.4.1.post6-rocm620 \
python3 -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct --host 0.0.0.0 --port 30000
# Till flashinfer backend available, --attention-backend triton --sampling-backend pytorch are set by default
drun v0.4.1.post6-rocm620 python3 -m sglang.bench_one_batch --batch-size 32 --input 1024 --output 128 --model amd/Meta-Llama-3.1-8B-Instruct-FP8-KV --tp 8 --quantization fp8
Method 4: Using docker compose#
More
This method is recommended if you plan to serve it as a service. A better approach is to use the k8s-sglang-service.yaml.
Copy the compose.yml to your local machine
Execute the command
docker compose up -d
in your terminal.
Method 5: Run on Kubernetes or Clouds with SkyPilot#
More
To deploy on Kubernetes or 12+ clouds, you can use SkyPilot.
Install SkyPilot and set up Kubernetes cluster or cloud access: see SkyPilot’s documentation.
Deploy on your own infra with a single command and get the HTTP API endpoint:
SkyPilot YAML: sglang.yaml
# sglang.yaml
envs:
HF_TOKEN: null
resources:
image_id: docker:lmsysorg/sglang:latest
accelerators: A100
ports: 30000
run: |
conda deactivate
python3 -m sglang.launch_server \
--model-path meta-llama/Llama-3.1-8B-Instruct \
--host 0.0.0.0 \
--port 30000
# Deploy on any cloud or Kubernetes cluster. Use --cloud <cloud> to select a specific cloud provider.
HF_TOKEN=<secret> sky launch -c sglang --env HF_TOKEN sglang.yaml
# Get the HTTP API endpoint
sky status --endpoint 30000 sglang
To further scale up your deployment with autoscaling and failure recovery, check out the SkyServe + SGLang guide.
Common Notes#
FlashInfer is the default attention kernel backend. It only supports sm75 and above. If you encounter any FlashInfer-related issues on sm75+ devices (e.g., T4, A10, A100, L4, L40S, H100), please switch to other kernels by adding
--attention-backend triton --sampling-backend pytorch
and open an issue on GitHub.If you only need to use OpenAI models with the frontend language, you can avoid installing other dependencies by using
pip install "sglang[openai]"
.The language frontend operates independently of the backend runtime. You can install the frontend locally without needing a GPU, while the backend can be set up on a GPU-enabled machine. To install the frontend, run
pip install sglang
, and for the backend, usepip install sglang[srt]
. This allows you to build SGLang programs locally and execute them by connecting to the remote backend.