SGLang Documentation#
SGLang is a fast serving framework for large language models and vision language models. It makes your interaction with models faster and more controllable by co-designing the backend runtime and frontend language. The core features include:
Fast Backend Runtime: Provides efficient serving with RadixAttention for prefix caching, jump-forward constrained decoding, overhead-free CPU scheduler, continuous batching, token attention (paged attention), tensor parallelism, FlashInfer kernels, chunked prefill, and quantization (FP8/INT4/AWQ/GPTQ).
Flexible Frontend Language: Offers an intuitive interface for programming LLM applications, including chained generation calls, advanced prompting, control flow, multi-modal inputs, parallelism, and external interactions.
Extensive Model Support: Supports a wide range of generative models (Llama, Gemma, Mistral, QWen, DeepSeek, LLaVA, etc.), embedding models (e5-mistral, gte) and reward models (Skywork), with easy extensibility for integrating new models.
Active Community: SGLang is open-source and backed by an active community with industry adoption.
Getting Started
Backend Tutorial
Frontend Tutorial
SGLang Router
References
- Supported Models
- Sampling Parameters in SGLang Runtime
- Guide on Hyperparameter Tuning
- Benchmark and Profiling
- Measuring Model Accuracy in SGLang
- Custom Chat Template in SGLang Runtime
- DeepSeek Model Optimizations
- Run Llama 3.1 405B
- Use Models From ModelScope
- Contribution Guide
- Troubleshooting
- Apply SGLang on NVIDIA Jetson Orin
- Frequently Asked Questions
- Learn more