Querying Qwen-VL#

[1]:

import nest_asyncio

nest_asyncio.apply()  # Run this first.

model_path = "Qwen/Qwen2.5-VL-3B-Instruct"
chat_template = "qwen2-vl"

[2]:

# Lets create a prompt.

from io import BytesIO
import requests
from PIL import Image

from sglang.srt.entrypoints.openai.protocol import ChatCompletionRequest
from sglang.srt.conversation import chat_templates

image = Image.open(
    BytesIO(
        requests.get(
            "https://github.com/sgl-project/sglang/blob/main/test/lang/example_image.png?raw=true"
        ).content
    )
)

conv = chat_templates[chat_template].copy()
conv.append_message(conv.roles[0], f"What's shown here: {conv.image_token}?")
conv.append_message(conv.roles[1], "")
conv.image_data = [image]

print(conv.get_prompt())
image

<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
What's shown here: <|vision_start|><|image_pad|><|vision_end|>?<|im_end|>
<|im_start|>assistant

[2]:

Query via the offline Engine API#

[3]:

from sglang import Engine

llm = Engine(
    model_path=model_path, chat_template=chat_template, mem_fraction_static=0.8
)

You have video processor config saved in `preprocessor.json` file which is deprecated. Video processor configs should be saved in their own `video_preprocessor.json` file. You can rename the file or load and save the processor back which renames it automatically. Loading from `preprocessor.json` will be removed in v5.0.
[2025-07-17 13:33:43] You have video processor config saved in `preprocessor.json` file which is deprecated. Video processor configs should be saved in their own `video_preprocessor.json` file. You can rename the file or load and save the processor back which renames it automatically. Loading from `preprocessor.json` will be removed in v5.0.
You have video processor config saved in `preprocessor.json` file which is deprecated. Video processor configs should be saved in their own `video_preprocessor.json` file. You can rename the file or load and save the processor back which renames it automatically. Loading from `preprocessor.json` will be removed in v5.0.
[2025-07-17 13:33:50] You have video processor config saved in `preprocessor.json` file which is deprecated. Video processor configs should be saved in their own `video_preprocessor.json` file. You can rename the file or load and save the processor back which renames it automatically. Loading from `preprocessor.json` will be removed in v5.0.
Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  50% Completed | 1/2 [00:01<00:01,  1.08s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:02<00:00,  1.00s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:02<00:00,  1.02s/it]

Capturing batches (bs=1 avail_mem=3.64 GB): 100%|██████████| 23/23 [00:09<00:00,  2.55it/s]

[4]:

out = llm.generate(prompt=conv.get_prompt(), image_data=[image])
print(out["text"])

The image shows a scene with several yellow taxis, some in double-decker configurations, driving or parked on a city street. The setting appears to be an urban area, with tall buildings and storefronts. The taxis are typical of the New York City cab fleet, known for their bright yellow color. Additionally, there are people in the foreground, and the roller retractable orange barrier on the street might indicate a construction zone or some form of road closure.

Query via the offline Engine API, but send precomputed embeddings#

[5]:

# Compute the image embeddings using Huggingface.

from transformers import AutoProcessor
from transformers import Qwen2_5_VLForConditionalGeneration

processor = AutoProcessor.from_pretrained(model_path, use_fast=True)
vision = (
    Qwen2_5_VLForConditionalGeneration.from_pretrained(model_path).eval().visual.cuda()
)

[6]:

processed_prompt = processor(
    images=[image], text=conv.get_prompt(), return_tensors="pt"
)
input_ids = processed_prompt["input_ids"][0].detach().cpu().tolist()
precomputed_features = vision(
    processed_prompt["pixel_values"].cuda(), processed_prompt["image_grid_thw"].cuda()
)

mm_item = dict(
    modality="IMAGE",
    image_grid_thw=processed_prompt["image_grid_thw"],
    precomputed_features=precomputed_features,
)
out = llm.generate(input_ids=input_ids, image_data=[mm_item])
print(out["text"])

The image shows a street scene featuring two yellow cabs. One cab is stationary, and the other, which is further away, appears to be moving. The scene includes a person wearing a yellow shirt standing next to one of the taxis. The person seems to be engaging with a stack of folded clothes or fabric. The street is lined with buildings, and there are some flags hanging from the windows of the buildings in the background.

Querying Qwen-VL

Contents

Querying Qwen-VL#

Query via the offline Engine API#

Query via the offline Engine API, but send precomputed embeddings#