Apply SGLang on NVIDIA Jetson Orin#
Prerequisites#
Before starting, ensure the following:
NVIDIA Jetson AGX Orin Devkit is set up with JetPack 6.1 or later.
CUDA Toolkit and cuDNN are installed.
Verify that the Jetson AGX Orin is in high-performance mode:
sudo nvpmodel -m 0
Installing and running SGLang with Jetson Containers#
Clone the jetson-containers github repository:
git clone https://github.com/dusty-nv/jetson-containers.git
Run the installation script:
bash jetson-containers/install.sh
Build the container:
CUDA_VERSION=12.6 jetson-containers build sglang
Run the container:
docker run --runtime nvidia -it --rm --network=host IMAGE_NAME
Running Inference#
Launch the server:
python -m sglang.launch_server \
--model-path deepseek-ai/DeepSeek-R1-Distill-Llama-8B \
--device cuda \
--dtype half \
--attention-backend flashinfer \
--mem-fraction-static 0.8 \
--context-length 8192
The quantization and limited context length (--dtype half --context-length 8192
) are due to the limited computational resources in Nvidia jetson kit. A detailed explanation can be found in Server Arguments.
After launching the engine, refer to Chat completions to test the usability.
Running quantization with TorchAO#
TorchAO is suggested to NVIDIA Jetson Orin.
python -m sglang.launch_server \
--model-path meta-llama/Meta-Llama-3.1-8B-Instruct \
--device cuda \
--dtype bfloat16 \
--attention-backend flashinfer \
--mem-fraction-static 0.8 \
--context-length 8192 \
--torchao-config int4wo-128
This enables TorchAO’s int4 weight-only quantization with a 128-group size. The usage of --torchao-config int4wo-128
is also for memory efficiency.
Structured output with XGrammar#
Please refer to SGLang doc structured output.
Thanks to the support from shahizat.