OpenAI APIs - Vision#
SGLang provides OpenAI-compatible APIs to enable a smooth transition from OpenAI services to self-hosted local models. A complete reference for the API is available in the OpenAI API Reference. This tutorial covers the vision APIs for vision language models.
SGLang supports various vision language models such as Llama 3.2, LLaVA-OneVision, Qwen2.5-VL, Gemma3 and more.
As an alternative to the OpenAI API, you can also use the SGLang offline engine.
Launch A Server#
Launch the server in your terminal and wait for it to initialize.
[1]:
from sglang.test.test_utils import is_in_ci
if is_in_ci():
from patch import launch_server_cmd
else:
from sglang.utils import launch_server_cmd
from sglang.utils import wait_for_server, print_highlight, terminate_process
vision_process, port = launch_server_cmd(
"""
python3 -m sglang.launch_server --model-path Qwen/Qwen2.5-VL-7B-Instruct
"""
)
wait_for_server(f"http://localhost:{port}")
[2025-06-15 07:31:07] server_args=ServerArgs(model_path='Qwen/Qwen2.5-VL-7B-Instruct', tokenizer_path='Qwen/Qwen2.5-VL-7B-Instruct', tokenizer_mode='auto', skip_tokenizer_init=False, load_format='auto', trust_remote_code=False, dtype='auto', kv_cache_dtype='auto', quantization=None, quantization_param_path=None, context_length=None, device='cuda', served_model_name='Qwen/Qwen2.5-VL-7B-Instruct', chat_template=None, completion_template=None, is_embedding=False, enable_multimodal=None, revision=None, impl='auto', host='127.0.0.1', port=35438, mem_fraction_static=0.874, max_running_requests=200, max_total_tokens=20480, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='fcfs', schedule_conservativeness=1.0, cpu_offload_gb=0, page_size=1, tp_size=1, pp_size=1, max_micro_batch_size=None, stream_interval=1, stream_output=False, random_seed=719677801, constrained_json_whitespace_pattern=None, watchdog_timeout=300, dist_timeout=None, download_dir=None, base_gpu_id=0, gpu_id_step=1, sleep_on_idle=False, log_level='info', log_level_http=None, log_requests=False, log_requests_level=0, show_time_cost=False, enable_metrics=False, bucket_time_to_first_token=None, bucket_e2e_request_latency=None, bucket_inter_token_latency=None, collect_tokens_histogram=False, decode_log_interval=40, enable_request_time_stats_logging=False, kv_events_config=None, api_key=None, file_storage_path='sglang_storage', enable_cache_report=False, reasoning_parser=None, tool_call_parser=None, dp_size=1, load_balance_method='round_robin', dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', preferred_sampling_params=None, lora_paths=None, max_loras_per_batch=8, lora_backend='triton', attention_backend=None, sampling_backend='flashinfer', grammar_backend='xgrammar', mm_attention_backend=None, speculative_algorithm=None, speculative_draft_model_path=None, speculative_num_steps=None, speculative_eagle_topk=None, speculative_num_draft_tokens=None, speculative_accept_threshold_single=1.0, speculative_accept_threshold_acc=1.0, speculative_token_map=None, ep_size=1, enable_ep_moe=False, enable_deepep_moe=False, deepep_mode='auto', ep_num_redundant_experts=0, ep_dispatch_algorithm='static', init_expert_location='trivial', enable_eplb=False, eplb_algorithm='auto', eplb_rebalance_num_iterations=1000, eplb_rebalance_layers_per_chunk=None, expert_distribution_recorder_mode=None, expert_distribution_recorder_buffer_size=1000, enable_expert_distribution_metrics=False, deepep_config=None, moe_dense_tp_size=None, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, disable_radix_cache=False, cuda_graph_max_bs=None, cuda_graph_bs=None, disable_cuda_graph=True, disable_cuda_graph_padding=False, enable_profile_cuda_graph=False, enable_nccl_nvls=False, enable_tokenizer_batch_encode=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, enable_mscclpp=False, disable_overlap_schedule=False, disable_overlap_cg_plan=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_dp_lm_head=False, enable_two_batch_overlap=False, enable_torch_compile=False, torch_compile_max_bs=32, torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, num_continuous_decode_steps=1, delete_ckpt_after_loading=False, enable_memory_saver=False, allow_auto_truncate=False, enable_custom_logit_processor=False, enable_hierarchical_cache=False, hicache_ratio=2.0, hicache_size=0, hicache_write_policy='write_through_selective', flashinfer_mla_disable_ragged=False, disable_shared_experts_fusion=False, disable_chunked_prefix_cache=False, disable_fast_image_processor=False, enable_return_hidden_states=False, warmups=None, debug_tensor_dump_output_folder=None, debug_tensor_dump_input_file=None, debug_tensor_dump_inject=False, debug_tensor_dump_prefill_only=False, disaggregation_mode='null', disaggregation_transfer_backend='mooncake', disaggregation_bootstrap_port=8998, disaggregation_decode_tp=None, disaggregation_decode_dp=None, disaggregation_prefill_pp=1, disaggregation_ib_device=None, num_reserved_decode_tokens=512, pdlb_url=None)
You have video processor config saved in `preprocessor.json` file which is deprecated. Video processor configs should be saved in their own `video_preprocessor.json` file. You can rename the file or load and save the processor back which renames it automatically. Loading from `preprocessor.json` will be removed in v5.0.
[2025-06-15 07:31:11] You have video processor config saved in `preprocessor.json` file which is deprecated. Video processor configs should be saved in their own `video_preprocessor.json` file. You can rename the file or load and save the processor back which renames it automatically. Loading from `preprocessor.json` will be removed in v5.0.
[2025-06-15 07:31:12] Infer the chat template name from the model path and obtain the result: qwen2-vl.
You have video processor config saved in `preprocessor.json` file which is deprecated. Video processor configs should be saved in their own `video_preprocessor.json` file. You can rename the file or load and save the processor back which renames it automatically. Loading from `preprocessor.json` will be removed in v5.0.
[2025-06-15 07:31:18] You have video processor config saved in `preprocessor.json` file which is deprecated. Video processor configs should be saved in their own `video_preprocessor.json` file. You can rename the file or load and save the processor back which renames it automatically. Loading from `preprocessor.json` will be removed in v5.0.
[2025-06-15 07:31:19] Attention backend not set. Use flashinfer backend by default.
[2025-06-15 07:31:19] Automatically reduce --mem-fraction-static to 0.787 because this is a multimodal model.
[2025-06-15 07:31:19] Init torch distributed begin.
[2025-06-15 07:31:19] Init torch distributed ends. mem usage=0.00 GB
[2025-06-15 07:31:19] Load weight begin. avail mem=46.68 GB
[2025-06-15 07:31:19] Multimodal attention backend not set. Use sdpa.
[2025-06-15 07:31:19] Using sdpa as multimodal attention backend.
[2025-06-15 07:31:19] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards: 0% Completed | 0/5 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 20% Completed | 1/5 [00:00<00:03, 1.05it/s]
Loading safetensors checkpoint shards: 40% Completed | 2/5 [00:01<00:02, 1.17it/s]
Loading safetensors checkpoint shards: 60% Completed | 3/5 [00:02<00:01, 1.25it/s]
Loading safetensors checkpoint shards: 80% Completed | 4/5 [00:03<00:00, 1.31it/s]
Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:03<00:00, 1.72it/s]
Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:03<00:00, 1.45it/s]
[2025-06-15 07:31:23] Load weight end. type=Qwen2_5_VLForConditionalGeneration, dtype=torch.bfloat16, avail mem=30.88 GB, mem usage=15.80 GB.
[2025-06-15 07:31:23] KV Cache is allocated. #tokens: 20480, K size: 0.55 GB, V size: 0.55 GB
[2025-06-15 07:31:23] Memory pool end. avail mem=29.51 GB
[2025-06-15 07:31:25] max_total_num_tokens=20480, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=200, context_len=128000, available_gpu_mem=28.94 GB
[2025-06-15 07:31:25] INFO: Started server process [2041647]
[2025-06-15 07:31:25] INFO: Waiting for application startup.
[2025-06-15 07:31:25] INFO: Application startup complete.
[2025-06-15 07:31:25] INFO: Uvicorn running on http://127.0.0.1:35438 (Press CTRL+C to quit)
[2025-06-15 07:31:25] INFO: 127.0.0.1:33816 - "GET /v1/models HTTP/1.1" 200 OK
[2025-06-15 07:31:26] INFO: 127.0.0.1:33822 - "GET /get_model_info HTTP/1.1" 200 OK
[2025-06-15 07:31:26] Prefill batch. #new-seq: 1, #new-token: 6, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-06-15 07:31:28] INFO: 127.0.0.1:33824 - "POST /generate HTTP/1.1" 200 OK
[2025-06-15 07:31:28] The server is fired up and ready to roll!
NOTE: Typically, the server runs in a separate terminal.
In this notebook, we run the server and notebook code together, so their outputs are combined.
To improve clarity, the server logs are displayed in the original black color, while the notebook outputs are highlighted in blue.
We are running those notebooks in a CI parallel environment, so the throughput is not representative of the actual performance.
Using cURL#
Once the server is up, you can send test requests using curl or requests.
[2]:
import subprocess
curl_command = f"""
curl -s http://localhost:{port}/v1/chat/completions \\
-d '{{
"model": "Qwen/Qwen2.5-VL-7B-Instruct",
"messages": [
{{
"role": "user",
"content": [
{{
"type": "text",
"text": "What’s in this image?"
}},
{{
"type": "image_url",
"image_url": {{
"url": "https://github.com/sgl-project/sglang/blob/main/test/lang/example_image.png?raw=true"
}}
}}
]
}}
],
"max_tokens": 300
}}'
"""
response = subprocess.check_output(curl_command, shell=True).decode()
print_highlight(response)
response = subprocess.check_output(curl_command, shell=True).decode()
print_highlight(response)
[2025-06-15 07:31:31] Prefill batch. #new-seq: 1, #new-token: 307, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-06-15 07:31:32] Decode batch. #running-req: 1, #token: 340, token usage: 0.02, cuda graph: False, gen throughput (token/s): 5.72, #queue-req: 0
[2025-06-15 07:31:32] Decode batch. #running-req: 1, #token: 380, token usage: 0.02, cuda graph: False, gen throughput (token/s): 63.17, #queue-req: 0
[2025-06-15 07:31:33] Decode batch. #running-req: 1, #token: 420, token usage: 0.02, cuda graph: False, gen throughput (token/s): 63.47, #queue-req: 0
[2025-06-15 07:31:33] INFO: 127.0.0.1:59654 - "POST /v1/chat/completions HTTP/1.1" 200 OK
{"id":"5e2a6bcca54746c8a4bf2e9e9a989ff6","object":"chat.completion","created":1749972690,"model":"Qwen/Qwen2.5-VL-7B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"The image shows a man standing behind the rear of a yellow taxi car that resembles a taxi, but there is a humorous twist—a makeshift \"stirrup\" connected to a pair of pants (presumably the man's own) is draped over the back of the car's rear bumper. The man appears to be balancing on this improvised device while trying to use it like a seesaw, with two poles extending downward. It’s a comedic setup indicating that something isn’t quite right—they are likely pretending to balance like a life-size toy. The surrounding urban street setting with taxis and buildings reinforces the city vibe, and there seems to be a playful or absurd undertone to the scene.","reasoning_content":null,"tool_calls":null},"logprobs":null,"finish_reason":"stop","matched_stop":151645}],"usage":{"prompt_tokens":307,"total_tokens":445,"completion_tokens":138,"prompt_tokens_details":null}}
[2025-06-15 07:31:34] Prefill batch. #new-seq: 1, #new-token: 1, #cached-token: 306, token usage: 0.01, #running-req: 0, #queue-req: 0
[2025-06-15 07:31:34] Decode batch. #running-req: 1, #token: 322, token usage: 0.02, cuda graph: False, gen throughput (token/s): 37.09, #queue-req: 0
[2025-06-15 07:31:35] Decode batch. #running-req: 1, #token: 362, token usage: 0.02, cuda graph: False, gen throughput (token/s): 63.80, #queue-req: 0
[2025-06-15 07:31:35] INFO: 127.0.0.1:59664 - "POST /v1/chat/completions HTTP/1.1" 200 OK
{"id":"8b5a0a0d692b44719c0cafb2d68a1ebe","object":"chat.completion","created":1749972693,"model":"Qwen/Qwen2.5-VL-7B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"The image depicts a man performing laundry outdoors, ironing clothes on a small clothing steamer that rests directly on the trunk of a yellow vehicle, possibly a taxi, in a city street. The man appears focused on the task. There are other cars and a building in the background indicative of an urban setting.","reasoning_content":null,"tool_calls":null},"logprobs":null,"finish_reason":"stop","matched_stop":151645}],"usage":{"prompt_tokens":307,"total_tokens":370,"completion_tokens":63,"prompt_tokens_details":null}}
Using Python Requests#
[3]:
import requests
url = f"http://localhost:{port}/v1/chat/completions"
data = {
"model": "Qwen/Qwen2.5-VL-7B-Instruct",
"messages": [
{
"role": "user",
"content": [
{"type": "text", "text": "What’s in this image?"},
{
"type": "image_url",
"image_url": {
"url": "https://github.com/sgl-project/sglang/blob/main/test/lang/example_image.png?raw=true"
},
},
],
}
],
"max_tokens": 300,
}
response = requests.post(url, json=data)
print_highlight(response.text)
[2025-06-15 07:31:35] Prefill batch. #new-seq: 1, #new-token: 1, #cached-token: 306, token usage: 0.01, #running-req: 0, #queue-req: 0
[2025-06-15 07:31:36] Decode batch. #running-req: 1, #token: 339, token usage: 0.02, cuda graph: False, gen throughput (token/s): 37.82, #queue-req: 0
[2025-06-15 07:31:36] Decode batch. #running-req: 1, #token: 379, token usage: 0.02, cuda graph: False, gen throughput (token/s): 62.94, #queue-req: 0
[2025-06-15 07:31:37] INFO: 127.0.0.1:59670 - "POST /v1/chat/completions HTTP/1.1" 200 OK
{"id":"9f0d100bc15c4b4092ae5b851156a208","object":"chat.completion","created":1749972695,"model":"Qwen/Qwen2.5-VL-7B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"This image shows a person standing at the back of a taxicab on what appears to be a busy street. The individual, wearing a yellow shirt, is using the back window of the vehicle to act as a makeshift board for hanging and ironing laundry. The taxicab is yellow, typical of taxicabs in New York City. Other vehicles and a city street setting with buildings and traffic signs can be seen in the background.","reasoning_content":null,"tool_calls":null},"logprobs":null,"finish_reason":"stop","matched_stop":151645}],"usage":{"prompt_tokens":307,"total_tokens":396,"completion_tokens":89,"prompt_tokens_details":null}}
Using OpenAI Python Client#
[4]:
from openai import OpenAI
client = OpenAI(base_url=f"http://localhost:{port}/v1", api_key="None")
response = client.chat.completions.create(
model="Qwen/Qwen2.5-VL-7B-Instruct",
messages=[
{
"role": "user",
"content": [
{
"type": "text",
"text": "What is in this image?",
},
{
"type": "image_url",
"image_url": {
"url": "https://github.com/sgl-project/sglang/blob/main/test/lang/example_image.png?raw=true"
},
},
],
}
],
max_tokens=300,
)
print_highlight(response.choices[0].message.content)
[2025-06-15 07:31:37] Prefill batch. #new-seq: 1, #new-token: 292, #cached-token: 15, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-06-15 07:31:37] Decode batch. #running-req: 1, #token: 330, token usage: 0.02, cuda graph: False, gen throughput (token/s): 34.97, #queue-req: 0
[2025-06-15 07:31:38] Decode batch. #running-req: 1, #token: 370, token usage: 0.02, cuda graph: False, gen throughput (token/s): 64.34, #queue-req: 0
[2025-06-15 07:31:39] Decode batch. #running-req: 1, #token: 410, token usage: 0.02, cuda graph: False, gen throughput (token/s): 63.81, #queue-req: 0
[2025-06-15 07:31:39] INFO: 127.0.0.1:59678 - "POST /v1/chat/completions HTTP/1.1" 200 OK
This image shows a scene on a city street, likely a busy roadway with taxis in the background. At the center, a man dressed in a yellow shirt and jeans appears to be using a tabletop ironing board, holding a shirt, while leaning out of the open trunk of a taxi parked on the street. The situation looks unusual and somewhat amusing, suggesting the man may be multitasking or trying to get something ready temporarily during a busy day on the road, potentially involving a cab driver hooking up with a clothing service for ironing or folding.
Multiple-Image Inputs#
The server also supports multiple images and interleaved text and images if the model supports it.
[5]:
from openai import OpenAI
client = OpenAI(base_url=f"http://localhost:{port}/v1", api_key="None")
response = client.chat.completions.create(
model="Qwen/Qwen2.5-VL-7B-Instruct",
messages=[
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {
"url": "https://github.com/sgl-project/sglang/blob/main/test/lang/example_image.png?raw=true",
},
},
{
"type": "image_url",
"image_url": {
"url": "https://raw.githubusercontent.com/sgl-project/sglang/main/assets/logo.png",
},
},
{
"type": "text",
"text": "I have two very different images. They are not related at all. "
"Please describe the first image in one sentence, and then describe the second image in another sentence.",
},
],
}
],
temperature=0,
)
print_highlight(response.choices[0].message.content)
[2025-06-15 07:31:40] Prefill batch. #new-seq: 1, #new-token: 2532, #cached-token: 14, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-06-15 07:31:41] Decode batch. #running-req: 1, #token: 2578, token usage: 0.13, cuda graph: False, gen throughput (token/s): 16.90, #queue-req: 0
[2025-06-15 07:31:41] INFO: 127.0.0.1:56026 - "POST /v1/chat/completions HTTP/1.1" 200 OK
The first image shows a man ironing clothes on the back of a taxi in a busy urban street. The second image is a stylized logo featuring the letters "SGL" with a book and a computer icon incorporated into the design.
[6]:
terminate_process(vision_process)
[2025-06-15 07:31:41] Child process unexpectedly failed with exitcode=9. pid=2042226