Vision Language Models#
These models accept multi-modal inputs (e.g., images and text) and generate text output. They augment language models with visual encoders and require a specific chat template for handling vision prompts.
Important
We need to specify --chat-template
for VLMs because the chat template provided in HuggingFace tokenizer only supports text. If you do not specify a vision model’s --chat-template
, the server uses HuggingFace’s default template, which only supports text and the images won’t be passed in.
Example launch Command#
python3 -m sglang.launch_server \
--model-path meta-llama/Llama-3.2-11B-Vision-Instruct \ # example HF/local path
--chat-template llama_3_vision \ # required chat template
--host 0.0.0.0 \
--port 30000 \
Supporting Matrixs#
Model Family (Variants) |
Example HuggingFace Identifier |
Chat Template |
Description |
---|---|---|---|
Qwen-VL (Qwen2 series) |
|
|
Alibaba’s vision-language extension of Qwen; for example, Qwen2.5-VL (7B and larger variants) can analyze and converse about image content. |
DeepSeek-VL2 |
|
|
Vision-language variant of DeepSeek (with a dedicated image processor), enabling advanced multimodal reasoning on image and text inputs. |
Janus-Pro (1B, 7B) |
|
|
DeepSeek’s open-source multimodal model capable of both image understanding and generation. Janus-Pro employs a decoupled architecture for separate visual encoding paths, enhancing performance in both tasks. |
MiniCPM-V / MiniCPM-o |
|
|
MiniCPM-V (2.6, ~8B) supports image inputs, and MiniCPM-o adds audio/video; these multimodal LLMs are optimized for end-side deployment on mobile/edge devices. |
Llama 3.2 Vision (11B) |
|
|
Vision-enabled variant of Llama 3 (11B) that accepts image inputs for visual question answering and other multimodal tasks. |
LLaVA (v1.5 & v1.6) |
e.g. |
|
Open vision-chat models that add an image encoder to LLaMA/Vicuna (e.g. LLaMA2 13B) for following multimodal instruction prompts. |
LLaVA-NeXT (8B, 72B) |
|
|
Improved LLaVA models (with an 8B Llama3 version and a 72B version) offering enhanced visual instruction-following and accuracy on multimodal benchmarks. |
LLaVA-OneVision |
|
|
Enhanced LLaVA variant integrating Qwen as the backbone; supports multiple images (and even video frames) as inputs via an OpenAI Vision API-compatible format. |
Gemma 3 (Multimodal) |
|
|
Gemma 3’s larger models (4B, 12B, 27B) accept images (each image encoded as 256 tokens) alongside text in a combined 128K-token context. |