Large Language Models#
These models accept text input and produce text output (e.g., chat completions). They are primarily large language models (LLMs), some with mixture-of-experts (MoE) architectures for scaling.
Example launch Command#
python3 -m sglang.launch_server \
--model-path meta-llama/Llama-3.2-1B-Instruct \ # example HF/local path
--host 0.0.0.0 \
--port 30000 \
Supporting Matrixs#
Model Family (Variants) |
Example HuggingFace Identifier |
Description |
---|---|---|
DeepSeek (v1, v2, v3/R1) |
|
Series of advanced reasoning-optimized models (including a 671B MoE) trained with reinforcement learning; top performance on complex reasoning, math, and code tasks. SGLang provides Deepseek v3/R1 model-specific optimizations |
Qwen (2, 2.5 series, MoE) |
|
Alibaba’s Qwen model family (7B to 72B) with SOTA performance; Qwen2.5 series improves multilingual capability and includes base, instruct, MoE, and code-tuned variants. |
Llama (2, 3.x, 4 series) |
|
Meta’s open LLM series, spanning 7B to 400B parameters (Llama 2, 3, and new Llama 4) with well-recognized performance. SGLang provides Llama-4 model-specific optimizations |
Mistral (Mixtral, NeMo, Small3) |
|
Open 7B LLM by Mistral AI with strong performance; extended into MoE (“Mixtral”) and NeMo Megatron variants for larger scale. |
Gemma (v1, v2, v3) |
|
Google’s family of efficient multilingual models (1B–27B); Gemma 3 offers a 128K context window, and its larger (4B+) variants support vision input. |
Phi (Phi-3, Phi-4 series) |
|
Microsoft’s Phi family of small models (1.3B–5.6B); Phi-4-mini is a high-accuracy text model and Phi-4-multimodal (5.6B) processes text, images, and speech in one compact model. |
MiniCPM (v3, 4B) |
|
OpenBMB’s series of compact LLMs for edge devices; MiniCPM 3 (4B) achieves GPT-3.5-level results in text tasks. |
OLMoE (Open MoE) |
|
Allen AI’s open Mixture-of-Experts model (7B total, 1B active parameters) delivering state-of-the-art results with sparse expert activation. |
StableLM (3B, 7B) |
|
StabilityAI’s early open-source LLM (3B & 7B) for general text generation; a demonstration model with basic instruction-following ability. |
Command-R (Cohere) |
|
Cohere’s open conversational LLM (Command series) optimized for long context, retrieval-augmented generation, and tool use. |
DBRX (Databricks) |
|
Databricks’ 132B-parameter MoE model (36B active) trained on 12T tokens; competes with GPT-3.5 quality as a fully open foundation model. |
Grok (xAI) |
|
xAI’s grok-1 model known for vast size(314B parameters) and high quality; integrated in SGLang for high-performance inference. |
ChatGLM (GLM-130B family) |
|
Zhipu AI’s bilingual chat model (6B) excelling at Chinese-English dialogue; fine-tuned for conversational quality and alignment. |
InternLM 2 (7B, 20B) |
|
Next-gen InternLM (7B and 20B) from SenseTime, offering strong reasoning and ultra-long context support (up to 200K tokens). |
ExaONE 3 (Korean-English) |
|
LG AI Research’s Korean-English model (7.8B) trained on 8T tokens; provides high-quality bilingual understanding and generation. |
Baichuan 2 (7B, 13B) |
|
BaichuanAI’s second-generation Chinese-English LLM (7B/13B) with improved performance and an open commercial license. |
XVERSE (MoE) |
|
Yuanxiang’s open MoE LLM (XVERSE-MoE-A36B: 255B total, 36B active) supporting ~40 languages; delivers 100B+ dense-level performance via expert routing. |
SmolLM (135M–1.7B) |
|
Hugging Face’s ultra-small LLM series (135M–1.7B params) offering surprisingly strong results, enabling advanced AI on mobile/edge devices. |
GLM-4 (Multilingual 9B) |
|
Zhipu’s GLM-4 series (up to 9B parameters) – open multilingual models with support for 1M-token context and even a 5.6B multimodal variant (Phi-4V). |