Large Language Models#
These models accept text input and produce text output (e.g., chat completions). They are primarily large language models (LLMs), some with mixture-of-experts (MoE) architectures for scaling.
Example launch Command#
python3 -m sglang.launch_server \
--model-path meta-llama/Llama-3.2-1B-Instruct \ # example HF/local path
--host 0.0.0.0 \
--port 30000 \
Supported models#
Below the supported models are summarized in a table.
If you are unsure if a specific architecture is implemented, you can search for it via GitHub. For example, to search for Qwen3ForCausalLM
, use the expression:
repo:sgl-project/sglang path:/^python\/sglang\/srt\/models\// Qwen3ForCausalLM
in the GitHub search bar.
Model Family (Variants) |
Example HuggingFace Identifier |
Description |
---|---|---|
DeepSeek (v1, v2, v3/R1) |
|
Series of advanced reasoning-optimized models (including a 671B MoE) trained with reinforcement learning; top performance on complex reasoning, math, and code tasks. SGLang provides Deepseek v3/R1 model-specific optimizations and Reasoning Parser |
Qwen (3, 3MoE, 2.5, 2 series) |
|
Alibaba’s latest Qwen3 series for complex reasoning, language understanding, and generation tasks; Support for MoE variants along with previous generation 2.5, 2, etc. SGLang provides Qwen3 specific reasoning parser |
Llama (2, 3.x, 4 series) |
|
Meta’s open LLM series, spanning 7B to 400B parameters (Llama 2, 3, and new Llama 4) with well-recognized performance. SGLang provides Llama-4 model-specific optimizations |
Mistral (Mixtral, NeMo, Small3) |
|
Open 7B LLM by Mistral AI with strong performance; extended into MoE (“Mixtral”) and NeMo Megatron variants for larger scale. |
Gemma (v1, v2, v3) |
|
Google’s family of efficient multilingual models (1B–27B); Gemma 3 offers a 128K context window, and its larger (4B+) variants support vision input. |
Phi (Phi-1.5, Phi-2, Phi-3, Phi-4, Phi-MoE series) |
|
Microsoft’s Phi family of small models (1.3B–5.6B); Phi-4-multimodal (5.6B) processes text, images, and speech, Phi-4-mini is a high-accuracy text model and Phi-3.5-MoE is a mixture-of-experts model. |
MiniCPM (v3, 4B) |
|
OpenBMB’s series of compact LLMs for edge devices; MiniCPM 3 (4B) achieves GPT-3.5-level results in text tasks. |
OLMoE (Open MoE) |
|
Allen AI’s open Mixture-of-Experts model (7B total, 1B active parameters) delivering state-of-the-art results with sparse expert activation. |
StableLM (3B, 7B) |
|
StabilityAI’s early open-source LLM (3B & 7B) for general text generation; a demonstration model with basic instruction-following ability. |
Command-R (Cohere) |
|
Cohere’s open conversational LLM (Command series) optimized for long context, retrieval-augmented generation, and tool use. |
DBRX (Databricks) |
|
Databricks’ 132B-parameter MoE model (36B active) trained on 12T tokens; competes with GPT-3.5 quality as a fully open foundation model. |
Grok (xAI) |
|
xAI’s grok-1 model known for vast size(314B parameters) and high quality; integrated in SGLang for high-performance inference. |
ChatGLM (GLM-130B family) |
|
Zhipu AI’s bilingual chat model (6B) excelling at Chinese-English dialogue; fine-tuned for conversational quality and alignment. |
InternLM 2 (7B, 20B) |
|
Next-gen InternLM (7B and 20B) from SenseTime, offering strong reasoning and ultra-long context support (up to 200K tokens). |
ExaONE 3 (Korean-English) |
|
LG AI Research’s Korean-English model (7.8B) trained on 8T tokens; provides high-quality bilingual understanding and generation. |
Baichuan 2 (7B, 13B) |
|
BaichuanAI’s second-generation Chinese-English LLM (7B/13B) with improved performance and an open commercial license. |
XVERSE (MoE) |
|
Yuanxiang’s open MoE LLM (XVERSE-MoE-A36B: 255B total, 36B active) supporting ~40 languages; delivers 100B+ dense-level performance via expert routing. |
SmolLM (135M–1.7B) |
|
Hugging Face’s ultra-small LLM series (135M–1.7B params) offering surprisingly strong results, enabling advanced AI on mobile/edge devices. |
GLM-4 (Multilingual 9B) |
|
Zhipu’s GLM-4 series (up to 9B parameters) – open multilingual models with support for 1M-token context and even a 5.6B multimodal variant (Phi-4V). |
MiMo (7B series) |
|
Xiaomi’s reasoning-optimized model series, leverages Multiple-Token Prediction for faster inference. |
ERNIE-4.5 (4.5, 4.5MoE series) |
|
Baidu’s ERNIE-4.5 series which consists of MoE with 47B and 3B active parameters, with the largest model having 424B total parameters, as well as a 0.3B dense model. |
Arcee AFM-4.5B |
|
Arcee’s foundational model series for real world reliability and edge deployments. |
Persimmon (8B) |
|
Adept’s open 8B model with a 16K context window and fast inference; trained for broad usability and licensed under Apache 2.0. |
Ling (16.8B–290B) |
|
InclusionAI’s open MoE models. Ling-Lite has 16.8B total / 2.75B active parameters, and Ling-Plus has 290B total / 28.8B active parameters. They are designed for high performance on NLP and complex reasoning tasks. |
Granite 3.0, 3.1 (IBM) |
|
IBM’s open dense foundation models optimized for reasoning, code, and business AI use cases. Integrated with Red Hat and watsonx systems. |
Granite 3.0 MoE (IBM) |
|
IBM’s Mixture-of-Experts models offering strong performance with cost-efficiency. MoE expert routing designed for enterprise deployment at scale. |
Llama Nemotron Super (v1, v1.5, NVIDIA) |
|
The NVIDIA Nemotron family builds on the strongest open models in the ecosystem by enhancing them with greater accuracy, efficiency, and transparency using NVIDIA open synthetic datasets, advanced techniques, and tools. This enables the creation of practical, right-sized, and high-performing AI agents. |
Llama Nemotron Ultra (v1, NVIDIA) |
|
The NVIDIA Nemotron family builds on the strongest open models in the ecosystem by enhancing them with greater accuracy, efficiency, and transparency using NVIDIA open synthetic datasets, advanced techniques, and tools. This enables the creation of practical, right-sized, and high-performing AI agents. |