OpenAI APIs - Vision#

SGLang provides OpenAI-compatible APIs to enable a smooth transition from OpenAI services to self-hosted local models. A complete reference for the API is available in the OpenAI API Reference. This tutorial covers the vision APIs for vision language models.

SGLang supports various vision language models such as Llama 3.2, LLaVA-OneVision, Qwen2.5-VL, Gemma3 and more.

As an alternative to the OpenAI API, you can also use the SGLang offline engine.

Launch A Server#

Launch the server in your terminal and wait for it to initialize.

[1]:

from sglang.test.test_utils import is_in_ci

if is_in_ci():
    from patch import launch_server_cmd
else:
    from sglang.utils import launch_server_cmd

from sglang.utils import wait_for_server, print_highlight, terminate_process

vision_process, port = launch_server_cmd(
    """
python3 -m sglang.launch_server --model-path Qwen/Qwen2.5-VL-7B-Instruct
"""
)

wait_for_server(f"http://localhost:{port}")

[2025-07-05 11:08:26] server_args=ServerArgs(model_path='Qwen/Qwen2.5-VL-7B-Instruct', tokenizer_path='Qwen/Qwen2.5-VL-7B-Instruct', tokenizer_mode='auto', skip_tokenizer_init=False, skip_server_warmup=False, load_format='auto', model_loader_extra_config='{}', trust_remote_code=False, dtype='auto', kv_cache_dtype='auto', quantization=None, quantization_param_path=None, context_length=None, device='cuda', served_model_name='Qwen/Qwen2.5-VL-7B-Instruct', chat_template=None, completion_template=None, is_embedding=False, enable_multimodal=None, revision=None, hybrid_kvcache_ratio=None, impl='auto', host='127.0.0.1', port=35079, nccl_port=None, mem_fraction_static=0.7866, max_running_requests=200, max_total_tokens=20480, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='fcfs', schedule_conservativeness=1.0, cpu_offload_gb=0, page_size=1, tp_size=1, pp_size=1, max_micro_batch_size=None, stream_interval=1, stream_output=False, random_seed=83603994, constrained_json_whitespace_pattern=None, watchdog_timeout=300, dist_timeout=None, download_dir=None, base_gpu_id=0, gpu_id_step=1, sleep_on_idle=False, log_level='info', log_level_http=None, log_requests=False, log_requests_level=0, crash_dump_folder=None, show_time_cost=False, enable_metrics=False, bucket_time_to_first_token=None, bucket_e2e_request_latency=None, bucket_inter_token_latency=None, collect_tokens_histogram=False, decode_log_interval=40, enable_request_time_stats_logging=False, kv_events_config=None, api_key=None, file_storage_path='sglang_storage', enable_cache_report=False, reasoning_parser=None, tool_call_parser=None, dp_size=1, load_balance_method='round_robin', dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', preferred_sampling_params=None, lora_paths=None, max_loras_per_batch=8, lora_backend='triton', attention_backend=None, sampling_backend='flashinfer', grammar_backend='xgrammar', mm_attention_backend=None, speculative_algorithm=None, speculative_draft_model_path=None, speculative_num_steps=None, speculative_eagle_topk=None, speculative_num_draft_tokens=None, speculative_accept_threshold_single=1.0, speculative_accept_threshold_acc=1.0, speculative_token_map=None, ep_size=1, enable_ep_moe=False, enable_deepep_moe=False, enable_flashinfer_moe=False, enable_flashinfer_allreduce_fusion=False, deepep_mode='auto', ep_num_redundant_experts=0, ep_dispatch_algorithm='static', init_expert_location='trivial', enable_eplb=False, eplb_algorithm='auto', eplb_rebalance_num_iterations=1000, eplb_rebalance_layers_per_chunk=None, expert_distribution_recorder_mode=None, expert_distribution_recorder_buffer_size=1000, enable_expert_distribution_metrics=False, deepep_config=None, moe_dense_tp_size=None, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, disable_radix_cache=False, cuda_graph_max_bs=None, cuda_graph_bs=None, disable_cuda_graph=True, disable_cuda_graph_padding=False, enable_profile_cuda_graph=False, enable_nccl_nvls=False, enable_tokenizer_batch_encode=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, enable_mscclpp=False, disable_overlap_schedule=False, disable_overlap_cg_plan=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_dp_lm_head=False, enable_two_batch_overlap=False, enable_torch_compile=False, torch_compile_max_bs=32, torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, num_continuous_decode_steps=1, delete_ckpt_after_loading=False, enable_memory_saver=False, allow_auto_truncate=False, enable_custom_logit_processor=False, enable_hierarchical_cache=False, hicache_ratio=2.0, hicache_size=0, hicache_write_policy='write_through_selective', flashinfer_mla_disable_ragged=False, disable_shared_experts_fusion=False, disable_chunked_prefix_cache=False, disable_fast_image_processor=False, enable_return_hidden_states=False, warmups=None, debug_tensor_dump_output_folder=None, debug_tensor_dump_input_file=None, debug_tensor_dump_inject=False, debug_tensor_dump_prefill_only=False, disaggregation_mode='null', disaggregation_transfer_backend='mooncake', disaggregation_bootstrap_port=8998, disaggregation_decode_tp=None, disaggregation_decode_dp=None, disaggregation_prefill_pp=1, disaggregation_ib_device=None, num_reserved_decode_tokens=512, pdlb_url=None, custom_weight_loader=[], weight_loader_disable_mmap=False)
You have video processor config saved in `preprocessor.json` file which is deprecated. Video processor configs should be saved in their own `video_preprocessor.json` file. You can rename the file or load and save the processor back which renames it automatically. Loading from `preprocessor.json` will be removed in v5.0.
[2025-07-05 11:08:28] You have video processor config saved in `preprocessor.json` file which is deprecated. Video processor configs should be saved in their own `video_preprocessor.json` file. You can rename the file or load and save the processor back which renames it automatically. Loading from `preprocessor.json` will be removed in v5.0.
[2025-07-05 11:08:28] Inferred chat template from model path: qwen2-vl
You have video processor config saved in `preprocessor.json` file which is deprecated. Video processor configs should be saved in their own `video_preprocessor.json` file. You can rename the file or load and save the processor back which renames it automatically. Loading from `preprocessor.json` will be removed in v5.0.
[2025-07-05 11:08:37] You have video processor config saved in `preprocessor.json` file which is deprecated. Video processor configs should be saved in their own `video_preprocessor.json` file. You can rename the file or load and save the processor back which renames it automatically. Loading from `preprocessor.json` will be removed in v5.0.
[2025-07-05 11:08:38] Attention backend not set. Use flashinfer backend by default.
[2025-07-05 11:08:38] Init torch distributed begin.
[2025-07-05 11:08:38] Init torch distributed ends. mem usage=0.00 GB
[2025-07-05 11:08:39] Load weight begin. avail mem=60.49 GB
[2025-07-05 11:08:39] Multimodal attention backend not set. Use sdpa.
[2025-07-05 11:08:39] Using sdpa as multimodal attention backend.
[2025-07-05 11:08:41] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards:   0% Completed | 0/5 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  20% Completed | 1/5 [00:00<00:02,  1.70it/s]
Loading safetensors checkpoint shards:  40% Completed | 2/5 [00:01<00:01,  1.54it/s]
Loading safetensors checkpoint shards:  60% Completed | 3/5 [00:01<00:01,  1.50it/s]
Loading safetensors checkpoint shards:  80% Completed | 4/5 [00:02<00:00,  1.48it/s]
Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:02<00:00,  1.92it/s]
Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:02<00:00,  1.72it/s]

[2025-07-05 11:08:44] Load weight end. type=Qwen2_5_VLForConditionalGeneration, dtype=torch.bfloat16, avail mem=44.79 GB, mem usage=15.71 GB.
[2025-07-05 11:08:44] KV Cache is allocated. #tokens: 20480, K size: 0.55 GB, V size: 0.55 GB
[2025-07-05 11:08:44] Memory pool end. avail mem=43.42 GB
[2025-07-05 11:08:45] max_total_num_tokens=20480, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=200, context_len=128000, available_gpu_mem=42.85 GB
[2025-07-05 11:08:46] INFO:     Started server process [2528481]
[2025-07-05 11:08:46] INFO:     Waiting for application startup.
[2025-07-05 11:08:46] INFO:     Application startup complete.
[2025-07-05 11:08:46] INFO:     Uvicorn running on http://127.0.0.1:35079 (Press CTRL+C to quit)
[2025-07-05 11:08:47] INFO:     127.0.0.1:37616 - "GET /v1/models HTTP/1.1" 200 OK
[2025-07-05 11:08:47] INFO:     127.0.0.1:37620 - "GET /get_model_info HTTP/1.1" 200 OK
[2025-07-05 11:08:47] Prefill batch. #new-seq: 1, #new-token: 6, #cached-token: 0, #token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-07-05 11:08:48] INFO:     127.0.0.1:37632 - "POST /generate HTTP/1.1" 200 OK
[2025-07-05 11:08:48] The server is fired up and ready to roll!

NOTE: Typically, the server runs in a separate terminal.
In this notebook, we run the server and notebook code together, so their outputs are combined.
To improve clarity, the server logs are displayed in the original black color, while the notebook outputs are highlighted in blue.
We are running those notebooks in a CI parallel environment, so the throughput is not representative of the actual performance.

Using cURL#

Once the server is up, you can send test requests using curl or requests.

[2]:

import subprocess

curl_command = f"""
curl -s http://localhost:{port}/v1/chat/completions \\
  -H "Content-Type: application/json" \\
  -d '{{
    "model": "Qwen/Qwen2.5-VL-7B-Instruct",
    "messages": [
      {{
        "role": "user",
        "content": [
          {{
            "type": "text",
            "text": "What’s in this image?"
          }},
          {{
            "type": "image_url",
            "image_url": {{
              "url": "https://github.com/sgl-project/sglang/blob/main/test/lang/example_image.png?raw=true"
            }}
          }}
        ]
      }}
    ],
    "max_tokens": 300
  }}'
"""

response = subprocess.check_output(curl_command, shell=True).decode()
print_highlight(response)


response = subprocess.check_output(curl_command, shell=True).decode()
print_highlight(response)

[2025-07-05 11:08:53] Prefill batch. #new-seq: 1, #new-token: 307, #cached-token: 0, #token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-07-05 11:08:54] Decode batch. #running-req: 1, #token: 340, token usage: 0.02, cuda graph: False, gen throughput (token/s): 4.89, #queue-req: 0
[2025-07-05 11:08:54] Decode batch. #running-req: 1, #token: 380, token usage: 0.02, cuda graph: False, gen throughput (token/s): 58.91, #queue-req: 0
[2025-07-05 11:08:55] INFO:     127.0.0.1:37638 - "POST /v1/chat/completions HTTP/1.1" 200 OK

{"id":"084f05ea51e347b78f8ab284eded70d5","object":"chat.completion","created":1751713735,"model":"Qwen/Qwen2.5-VL-7B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"The image depicts a man outdoors, seemingly in a busy urban street scene, leaning backward into the back of a yellow taxi. He's holding a portable clothing iron, and an ironing board held open vertically appears to be attached between the taxi and the sidewalk. He is adjusting blue clothing items hanging vertically on the board. The surrounding area suggests a cityscape, with taxis, trees, and storefronts in the background.","reasoning_content":null,"tool_calls":null},"logprobs":null,"finish_reason":"stop","matched_stop":151645}],"usage":{"prompt_tokens":307,"total_tokens":392,"completion_tokens":85,"prompt_tokens_details":null}}

[2025-07-05 11:08:55] Prefill batch. #new-seq: 1, #new-token: 1, #cached-token: 306, #token: 306, token usage: 0.01, #running-req: 0, #queue-req: 0
[2025-07-05 11:08:55] Decode batch. #running-req: 1, #token: 335, token usage: 0.02, cuda graph: False, gen throughput (token/s): 36.49, #queue-req: 0
[2025-07-05 11:08:56] Decode batch. #running-req: 1, #token: 375, token usage: 0.02, cuda graph: False, gen throughput (token/s): 57.84, #queue-req: 0
[2025-07-05 11:08:57] Decode batch. #running-req: 1, #token: 415, token usage: 0.02, cuda graph: False, gen throughput (token/s): 58.92, #queue-req: 0
[2025-07-05 11:08:57] INFO:     127.0.0.1:37652 - "POST /v1/chat/completions HTTP/1.1" 200 OK

{"id":"abeb3cecd6074514a940fe51d8a7b14c","object":"chat.completion","created":1751713737,"model":"Qwen/Qwen2.5-VL-7B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"The image depicts a yellow taxi parked on a city street, with a man leaning on the rear of the vehicle. The man appears to be attempting to iron some clothes, which are clipped onto an improvised ironing board stretched across two stands. The stands are positioned such that their frames connect to the car, and the man seems to be holding an iron while interacting with the clothes. The setting is urban, with storefronts, other taxis, and a visible \"Mint\" taxi advertisement in the background, suggesting a bustling city environment, possibly in New York due to the style of the taxis and advertisements.","reasoning_content":null,"tool_calls":null},"logprobs":null,"finish_reason":"stop","matched_stop":151645}],"usage":{"prompt_tokens":307,"total_tokens":429,"completion_tokens":122,"prompt_tokens_details":null}}

Using Python Requests#

[3]:

import requests

url = f"http://localhost:{port}/v1/chat/completions"

data = {
    "model": "Qwen/Qwen2.5-VL-7B-Instruct",
    "messages": [
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "What’s in this image?"},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://github.com/sgl-project/sglang/blob/main/test/lang/example_image.png?raw=true"
                    },
                },
            ],
        }
    ],
    "max_tokens": 300,
}

response = requests.post(url, json=data)
print_highlight(response.text)

[2025-07-05 11:08:57] Prefill batch. #new-seq: 1, #new-token: 1, #cached-token: 306, #token: 306, token usage: 0.01, #running-req: 0, #queue-req: 0
[2025-07-05 11:08:58] Decode batch. #running-req: 1, #token: 333, token usage: 0.02, cuda graph: False, gen throughput (token/s): 42.93, #queue-req: 0
[2025-07-05 11:08:58] Decode batch. #running-req: 1, #token: 373, token usage: 0.02, cuda graph: False, gen throughput (token/s): 57.90, #queue-req: 0
[2025-07-05 11:08:59] INFO:     127.0.0.1:58650 - "POST /v1/chat/completions HTTP/1.1" 200 OK

{"id":"0e79b25499d643d8908b52db036df5d2","object":"chat.completion","created":1751713739,"model":"Qwen/Qwen2.5-VL-7B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"The image depicts a person, dressed in a bright yellow shirt, ironing clothes on a makeshift rack. The rack appears to be attached externally to the back of a yellow taxi parked on a city street. The man is using an iron to press the clothes, ensuring their smoothness. The taxi and another taxi in the background are also parked next to what appears to be a storefront. The urban environment in the background includes tall buildings, trees, and street lamps.","reasoning_content":null,"tool_calls":null},"logprobs":null,"finish_reason":"stop","matched_stop":151645}],"usage":{"prompt_tokens":307,"total_tokens":401,"completion_tokens":94,"prompt_tokens_details":null}}

Using OpenAI Python Client#

[4]:

from openai import OpenAI

client = OpenAI(base_url=f"http://localhost:{port}/v1", api_key="None")

response = client.chat.completions.create(
    model="Qwen/Qwen2.5-VL-7B-Instruct",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "What is in this image?",
                },
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://github.com/sgl-project/sglang/blob/main/test/lang/example_image.png?raw=true"
                    },
                },
            ],
        }
    ],
    max_tokens=300,
)

print_highlight(response.choices[0].message.content)

[2025-07-05 11:08:59] Prefill batch. #new-seq: 1, #new-token: 292, #cached-token: 15, #token: 15, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-07-05 11:09:00] Decode batch. #running-req: 1, #token: 319, token usage: 0.02, cuda graph: False, gen throughput (token/s): 30.61, #queue-req: 0
[2025-07-05 11:09:00] Decode batch. #running-req: 1, #token: 359, token usage: 0.02, cuda graph: False, gen throughput (token/s): 56.50, #queue-req: 0
[2025-07-05 11:09:01] INFO:     127.0.0.1:58660 - "POST /v1/chat/completions HTTP/1.1" 200 OK

The image shows a man standing in the back of a taxi and hanging laundry to hang-dry them using a drying rack. The taxi, which is a yellow taxi cab, is parked in what appears to be a busy street environment, possibly in a city. There are other taxis visible in the background, suggesting it may be in a metropolitan area with a bustling taxi traffic. The man is actively ironing what seems to be garments or clothing.

Multiple-Image Inputs#

The server also supports multiple images and interleaved text and images if the model supports it.

[5]:

from openai import OpenAI

client = OpenAI(base_url=f"http://localhost:{port}/v1", api_key="None")

response = client.chat.completions.create(
    model="Qwen/Qwen2.5-VL-7B-Instruct",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://github.com/sgl-project/sglang/blob/main/test/lang/example_image.png?raw=true",
                    },
                },
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://raw.githubusercontent.com/sgl-project/sglang/main/assets/logo.png",
                    },
                },
                {
                    "type": "text",
                    "text": "I have two very different images. They are not related at all. "
                    "Please describe the first image in one sentence, and then describe the second image in another sentence.",
                },
            ],
        }
    ],
    temperature=0,
)

print_highlight(response.choices[0].message.content)

[2025-07-05 11:09:02] Prefill batch. #new-seq: 1, #new-token: 2532, #cached-token: 14, #token: 14, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-07-05 11:09:03] Decode batch. #running-req: 1, #token: 2548, token usage: 0.12, cuda graph: False, gen throughput (token/s): 17.81, #queue-req: 0
[2025-07-05 11:09:03] Decode batch. #running-req: 1, #token: 2588, token usage: 0.13, cuda graph: False, gen throughput (token/s): 57.88, #queue-req: 0
[2025-07-05 11:09:03] INFO:     127.0.0.1:58676 - "POST /v1/chat/completions HTTP/1.1" 200 OK

The first image shows a man ironing clothes on the back of a taxi in a busy urban street. The second image is a stylized logo featuring the letters "SGL" with a book and a computer icon incorporated into the design.

[6]:

terminate_process(vision_process)

[2025-07-05 11:09:03] Child process unexpectedly failed with exitcode=9. pid=2528908

OpenAI APIs - Vision

Contents