OpenAI APIs - Completions#

SGLang provides OpenAI-compatible APIs to enable a smooth transition from OpenAI services to self-hosted local models. A complete reference for the API is available in the OpenAI API Reference.

This tutorial covers the following popular APIs:

  • chat/completions

  • completions

  • batches

Check out other tutorials to learn about vision APIs for vision-language models and embedding APIs for embedding models.

Launch A Server#

Launch the server in your terminal and wait for it to initialize.

[1]:
from sglang.test.test_utils import is_in_ci

if is_in_ci():
    from patch import launch_server_cmd
else:
    from sglang.utils import launch_server_cmd

from sglang.utils import wait_for_server, print_highlight, terminate_process


server_process, port = launch_server_cmd(
    "python3 -m sglang.launch_server --model-path qwen/qwen2.5-0.5b-instruct --host 0.0.0.0 --mem-fraction-static 0.8"
)

wait_for_server(f"http://localhost:{port}")
print(f"Server started on http://localhost:{port}")
[2025-05-29 08:41:13] server_args=ServerArgs(model_path='qwen/qwen2.5-0.5b-instruct', tokenizer_path='qwen/qwen2.5-0.5b-instruct', tokenizer_mode='auto', skip_tokenizer_init=False, load_format='auto', trust_remote_code=False, dtype='auto', kv_cache_dtype='auto', quantization=None, quantization_param_path=None, context_length=None, device='cuda', served_model_name='qwen/qwen2.5-0.5b-instruct', chat_template=None, completion_template=None, is_embedding=False, enable_multimodal=None, revision=None, host='0.0.0.0', port=31171, mem_fraction_static=0.8, max_running_requests=200, max_total_tokens=20480, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='fcfs', schedule_conservativeness=1.0, cpu_offload_gb=0, page_size=1, tp_size=1, pp_size=1, max_micro_batch_size=None, stream_interval=1, stream_output=False, random_seed=348104412, constrained_json_whitespace_pattern=None, watchdog_timeout=300, dist_timeout=None, download_dir=None, base_gpu_id=0, gpu_id_step=1, log_level='info', log_level_http=None, log_requests=False, log_requests_level=0, show_time_cost=False, enable_metrics=False, bucket_time_to_first_token=None, bucket_e2e_request_latency=None, bucket_inter_token_latency=None, collect_tokens_histogram=False, decode_log_interval=40, enable_request_time_stats_logging=False, kv_events_config=None, api_key=None, file_storage_path='sglang_storage', enable_cache_report=False, reasoning_parser=None, dp_size=1, load_balance_method='round_robin', ep_size=1, dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', preferred_sampling_params=None, lora_paths=None, max_loras_per_batch=8, lora_backend='triton', attention_backend=None, sampling_backend='flashinfer', grammar_backend='xgrammar', speculative_algorithm=None, speculative_draft_model_path=None, speculative_num_steps=None, speculative_eagle_topk=None, speculative_num_draft_tokens=None, speculative_accept_threshold_single=1.0, speculative_accept_threshold_acc=1.0, speculative_token_map=None, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, disable_radix_cache=False, disable_cuda_graph=True, disable_cuda_graph_padding=False, enable_nccl_nvls=False, enable_tokenizer_batch_encode=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, disable_overlap_schedule=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_dp_lm_head=False, enable_two_batch_overlap=False, enable_ep_moe=False, enable_deepep_moe=False, deepep_mode='auto', ep_num_redundant_experts=0, ep_dispatch_algorithm='static', init_expert_location='trivial', enable_eplb=False, eplb_rebalance_num_iterations=1000, expert_distribution_recorder_mode=None, expert_distribution_recorder_buffer_size=1000, enable_expert_distribution_metrics=False, deepep_config=None, enable_torch_compile=False, torch_compile_max_bs=32, cuda_graph_max_bs=None, cuda_graph_bs=None, torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, num_continuous_decode_steps=1, delete_ckpt_after_loading=False, enable_memory_saver=False, allow_auto_truncate=False, enable_custom_logit_processor=False, tool_call_parser=None, enable_hierarchical_cache=False, hicache_ratio=2.0, hicache_size=0, hicache_write_policy='write_through_selective', flashinfer_mla_disable_ragged=False, warmups=None, moe_dense_tp_size=None, n_share_experts_fusion=0, disable_chunked_prefix_cache=False, disable_fast_image_processor=False, mm_attention_backend=None, debug_tensor_dump_output_folder=None, debug_tensor_dump_input_file=None, debug_tensor_dump_inject=False, disaggregation_mode='null', disaggregation_bootstrap_port=8998, disaggregation_transfer_backend='mooncake', disaggregation_ib_device=None, pdlb_url=None)
[2025-05-29 08:41:26] Attention backend not set. Use fa3 backend by default.
[2025-05-29 08:41:26] Init torch distributed begin.
[2025-05-29 08:41:26] Init torch distributed ends. mem usage=0.00 GB
[2025-05-29 08:41:26] init_expert_location from trivial
[2025-05-29 08:41:26] Load weight begin. avail mem=60.49 GB
[2025-05-29 08:41:27] Using model weights format ['*.safetensors']
[2025-05-29 08:41:27] No model.safetensors.index.json found in remote.
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.75it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.74it/s]

[2025-05-29 08:41:27] Load weight end. type=Qwen2ForCausalLM, dtype=torch.bfloat16, avail mem=59.51 GB, mem usage=0.98 GB.
[2025-05-29 08:41:27] KV Cache is allocated. #tokens: 20480, K size: 0.12 GB, V size: 0.12 GB
[2025-05-29 08:41:27] Memory pool end. avail mem=59.11 GB
[2025-05-29 08:41:28] max_total_num_tokens=20480, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=200, context_len=32768
[2025-05-29 08:41:28] INFO:     Started server process [2258093]
[2025-05-29 08:41:28] INFO:     Waiting for application startup.
[2025-05-29 08:41:28] INFO:     Application startup complete.
[2025-05-29 08:41:28] INFO:     Uvicorn running on http://0.0.0.0:31171 (Press CTRL+C to quit)
[2025-05-29 08:41:29] INFO:     127.0.0.1:39644 - "GET /v1/models HTTP/1.1" 200 OK
[2025-05-29 08:41:29] INFO:     127.0.0.1:39658 - "GET /get_model_info HTTP/1.1" 200 OK
[2025-05-29 08:41:29] Prefill batch. #new-seq: 1, #new-token: 6, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-05-29 08:41:31] INFO:     127.0.0.1:39662 - "POST /generate HTTP/1.1" 200 OK
[2025-05-29 08:41:31] The server is fired up and ready to roll!


NOTE: Typically, the server runs in a separate terminal.
In this notebook, we run the server and notebook code together, so their outputs are combined.
To improve clarity, the server logs are displayed in the original black color, while the notebook outputs are highlighted in blue.
We are running those notebooks in a CI parallel environment, so the throughput is not representative of the actual performance.
Server started on http://localhost:31171

Chat Completions#

Usage#

The server fully implements the OpenAI API. It will automatically apply the chat template specified in the Hugging Face tokenizer, if one is available. You can also specify a custom chat template with --chat-template when launching the server.

[2]:
import openai

client = openai.Client(base_url=f"http://127.0.0.1:{port}/v1", api_key="None")

response = client.chat.completions.create(
    model="qwen/qwen2.5-0.5b-instruct",
    messages=[
        {"role": "user", "content": "List 3 countries and their capitals."},
    ],
    temperature=0,
    max_tokens=64,
)

print_highlight(f"Response: {response}")
[2025-05-29 08:41:34] Prefill batch. #new-seq: 1, #new-token: 37, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-05-29 08:41:34] Decode batch. #running-req: 1, #token: 70, token usage: 0.00, cuda graph: False, gen throughput (token/s): 6.01, #queue-req: 0
[2025-05-29 08:41:34] INFO:     127.0.0.1:39670 - "POST /v1/chat/completions HTTP/1.1" 200 OK
Response: ChatCompletion(id='6749e1d0a4d84aecb45ec2e75a613301', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='Sure, here are three countries and their respective capitals:\n\n1. **United States** - Washington, D.C.\n2. **Canada** - Ottawa\n3. **Australia** - Canberra', refusal=None, role='assistant', annotations=None, audio=None, function_call=None, tool_calls=None, reasoning_content=None), matched_stop=151645)], created=1748508094, model='qwen/qwen2.5-0.5b-instruct', object='chat.completion', service_tier=None, system_fingerprint=None, usage=CompletionUsage(completion_tokens=39, prompt_tokens=37, total_tokens=76, completion_tokens_details=None, prompt_tokens_details=None))

Parameters#

The chat completions API accepts OpenAI Chat Completions API’s parameters. Refer to OpenAI Chat Completions API for more details.

SGLang extends the standard API with the extra_body parameter, allowing for additional customization. One key option within extra_body is chat_template_kwargs, which can be used to pass arguments to the chat template processor.

Enabling Model Thinking/Reasoning#

You can use chat_template_kwargs to enable or disable the model’s internal thinking or reasoning process output. Set "enable_thinking": True within chat_template_kwargs to include the reasoning steps in the response. This requires launching the server with a compatible reasoning parser (e.g., --reasoning-parser qwen3 for Qwen3 models).

Here’s an example demonstrating how to enable thinking and retrieve the reasoning content separately (using separate_reasoning: True):

# Ensure the server is launched with a compatible reasoning parser, e.g.:
# python3 -m sglang.launch_server --model-path QwQ/Qwen3-32B-250415 --reasoning-parser qwen3 ...

from openai import OpenAI

# Modify OpenAI's API key and API base to use SGLang's API server.
openai_api_key = "EMPTY"
openai_api_base = f"http://127.0.0.1:{port}/v1" # Use the correct port

client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

model = "QwQ/Qwen3-32B-250415" # Use the model loaded by the server
messages = [{"role": "user", "content": "9.11 and 9.8, which is greater?"}]

response = client.chat.completions.create(
    model=model,
    messages=messages,
    extra_body={
        "chat_template_kwargs": {"enable_thinking": True},
        "separate_reasoning": True
    }
)

print("response.choices[0].message.reasoning_content: \n", response.choices[0].message.reasoning_content)
print("response.choices[0].message.content: \n", response.choices[0].message.content)

Example Output:

response.choices[0].message.reasoning_content:
 Okay, so I need to figure out which number is greater between 9.11 and 9.8. Hmm, let me think. Both numbers start with 9, right? So the whole number part is the same. That means I need to look at the decimal parts to determine which one is bigger.
...
Therefore, after checking multiple methods—aligning decimals, subtracting, converting to fractions, and using a real-world analogy—it's clear that 9.8 is greater than 9.11.

response.choices[0].message.content:
 To determine which number is greater between **9.11** and **9.8**, follow these steps:
...
**Answer**:
9.8 is greater than 9.11.

Setting "enable_thinking": False (or omitting it) will result in reasoning_content being None.

Here is an example of a detailed chat completion request using standard OpenAI parameters:

[3]:
response = client.chat.completions.create(
    model="qwen/qwen2.5-0.5b-instruct",
    messages=[
        {
            "role": "system",
            "content": "You are a knowledgeable historian who provides concise responses.",
        },
        {"role": "user", "content": "Tell me about ancient Rome"},
        {
            "role": "assistant",
            "content": "Ancient Rome was a civilization centered in Italy.",
        },
        {"role": "user", "content": "What were their major achievements?"},
    ],
    temperature=0.3,  # Lower temperature for more focused responses
    max_tokens=128,  # Reasonable length for a concise response
    top_p=0.95,  # Slightly higher for better fluency
    presence_penalty=0.2,  # Mild penalty to avoid repetition
    frequency_penalty=0.2,  # Mild penalty for more natural language
    n=1,  # Single response is usually more stable
    seed=42,  # Keep for reproducibility
)

print_highlight(response.choices[0].message.content)
[2025-05-29 08:41:34] Prefill batch. #new-seq: 1, #new-token: 49, #cached-token: 5, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-05-29 08:41:35] Decode batch. #running-req: 1, #token: 88, token usage: 0.00, cuda graph: False, gen throughput (token/s): 124.02, #queue-req: 0
[2025-05-29 08:41:35] Decode batch. #running-req: 1, #token: 128, token usage: 0.01, cuda graph: False, gen throughput (token/s): 140.56, #queue-req: 0
[2025-05-29 08:41:35] Decode batch. #running-req: 1, #token: 168, token usage: 0.01, cuda graph: False, gen throughput (token/s): 143.00, #queue-req: 0
[2025-05-29 08:41:35] INFO:     127.0.0.1:39670 - "POST /v1/chat/completions HTTP/1.1" 200 OK
Ancient Rome was a civilization that flourished from the 8th to the 4th century BCE. It was known for its extensive network of roads, which facilitated trade and communication across the Mediterranean. The Romans built many impressive structures, including the Colosseum, the Pantheon, and the Forum of Connaught. They also developed a sophisticated system of law and legal codes, which were used to govern their empire. Additionally, Rome is known for its art and architecture, including the Colosseum, the Pantheon, and the Roman Forum.

Streaming mode is also supported.

[4]:
stream = client.chat.completions.create(
    model="qwen/qwen2.5-0.5b-instruct",
    messages=[{"role": "user", "content": "Say this is a test"}],
    stream=True,
)
for chunk in stream:
    if chunk.choices[0].delta.content is not None:
        print(chunk.choices[0].delta.content, end="")
[2025-05-29 08:41:35] INFO:     127.0.0.1:39670 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-05-29 08:41:35] Prefill batch. #new-seq: 1, #new-token: 10, #cached-token: 24, token usage: 0.00, #running-req: 0, #queue-req: 0
I apologize, but I'm not sure what you're asking or what you're testing. Could you please provide more context or clarify your question? I'd be happy to help if you can[2025-05-29 08:41:36] Decode batch. #running-req: 1, #token: 73, token usage: 0.00, cuda graph: False, gen throughput (token/s): 134.67, #queue-req: 0
 give me a specific query or topic to discuss.

Completions#

Usage#

Completions API is similar to Chat Completions API, but without the messages parameter or chat templates.

[5]:
response = client.completions.create(
    model="qwen/qwen2.5-0.5b-instruct",
    prompt="List 3 countries and their capitals.",
    temperature=0,
    max_tokens=64,
    n=1,
    stop=None,
)

print_highlight(f"Response: {response}")
[2025-05-29 08:41:36] Prefill batch. #new-seq: 1, #new-token: 8, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-05-29 08:41:36] Decode batch. #running-req: 1, #token: 38, token usage: 0.00, cuda graph: False, gen throughput (token/s): 112.64, #queue-req: 0
[2025-05-29 08:41:36] INFO:     127.0.0.1:39670 - "POST /v1/completions HTTP/1.1" 200 OK
Response: Completion(id='27aa16b85dfc4f7ba754196cde63ea25', choices=[CompletionChoice(finish_reason='length', index=0, logprobs=None, text=' 1. United States - Washington, D.C.\n2. Canada - Ottawa\n3. France - Paris\n4. Germany - Berlin\n5. Japan - Tokyo\n6. Italy - Rome\n7. Spain - Madrid\n8. United Kingdom - London\n9. Australia - Canberra\n10. New', matched_stop=None)], created=1748508096, model='qwen/qwen2.5-0.5b-instruct', object='text_completion', system_fingerprint=None, usage=CompletionUsage(completion_tokens=64, prompt_tokens=8, total_tokens=72, completion_tokens_details=None, prompt_tokens_details=None))

Parameters#

The completions API accepts OpenAI Completions API’s parameters. Refer to OpenAI Completions API for more details.

Here is an example of a detailed completions request:

[6]:
response = client.completions.create(
    model="qwen/qwen2.5-0.5b-instruct",
    prompt="Write a short story about a space explorer.",
    temperature=0.7,  # Moderate temperature for creative writing
    max_tokens=150,  # Longer response for a story
    top_p=0.9,  # Balanced diversity in word choice
    stop=["\n\n", "THE END"],  # Multiple stop sequences
    presence_penalty=0.3,  # Encourage novel elements
    frequency_penalty=0.3,  # Reduce repetitive phrases
    n=1,  # Generate one completion
    seed=123,  # For reproducible results
)

print_highlight(f"Response: {response}")
[2025-05-29 08:41:36] Prefill batch. #new-seq: 1, #new-token: 9, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-05-29 08:41:36] Decode batch. #running-req: 1, #token: 15, token usage: 0.00, cuda graph: False, gen throughput (token/s): 136.22, #queue-req: 0
[2025-05-29 08:41:37] Decode batch. #running-req: 1, #token: 55, token usage: 0.00, cuda graph: False, gen throughput (token/s): 141.54, #queue-req: 0
[2025-05-29 08:41:37] INFO:     127.0.0.1:39670 - "POST /v1/completions HTTP/1.1" 200 OK
Response: Completion(id='8e599c3d3f1f4a8895b9ad3efc12ab40', choices=[CompletionChoice(finish_reason='stop', index=0, logprobs=None, text=' Once upon a time, there was a young man named Jack who dreamed of exploring the vastness of space. He had always been fascinated by the stars and the mysteries of the universe. One day, while on a spacewalk, he stumbled upon a strange device that seemed to have been designed for something else entirely.', matched_stop='\n\n')], created=1748508096, model='qwen/qwen2.5-0.5b-instruct', object='text_completion', system_fingerprint=None, usage=CompletionUsage(completion_tokens=63, prompt_tokens=9, total_tokens=72, completion_tokens_details=None, prompt_tokens_details=None))

Structured Outputs (JSON, Regex, EBNF)#

For OpenAI compatible structured outputs API, refer to Structured Outputs for more details.

Batches#

Batches API for chat completions and completions are also supported. You can upload your requests in jsonl files, create a batch job, and retrieve the results when the batch job is completed (which takes longer but costs less).

The batches APIs are:

  • batches

  • batches/{batch_id}/cancel

  • batches/{batch_id}

Here is an example of a batch job for chat completions, completions are similar.

[7]:
import json
import time
from openai import OpenAI

client = OpenAI(base_url=f"http://127.0.0.1:{port}/v1", api_key="None")

requests = [
    {
        "custom_id": "request-1",
        "method": "POST",
        "url": "/chat/completions",
        "body": {
            "model": "qwen/qwen2.5-0.5b-instruct",
            "messages": [
                {"role": "user", "content": "Tell me a joke about programming"}
            ],
            "max_tokens": 50,
        },
    },
    {
        "custom_id": "request-2",
        "method": "POST",
        "url": "/chat/completions",
        "body": {
            "model": "qwen/qwen2.5-0.5b-instruct",
            "messages": [{"role": "user", "content": "What is Python?"}],
            "max_tokens": 50,
        },
    },
]

input_file_path = "batch_requests.jsonl"

with open(input_file_path, "w") as f:
    for req in requests:
        f.write(json.dumps(req) + "\n")

with open(input_file_path, "rb") as f:
    file_response = client.files.create(file=f, purpose="batch")

batch_response = client.batches.create(
    input_file_id=file_response.id,
    endpoint="/v1/chat/completions",
    completion_window="24h",
)

print_highlight(f"Batch job created with ID: {batch_response.id}")
[2025-05-29 08:41:37] INFO:     127.0.0.1:48630 - "POST /v1/files HTTP/1.1" 200 OK
[2025-05-29 08:41:37] INFO:     127.0.0.1:48630 - "POST /v1/batches HTTP/1.1" 200 OK
[2025-05-29 08:41:37] Prefill batch. #new-seq: 2, #new-token: 20, #cached-token: 48, token usage: 0.00, #running-req: 0, #queue-req: 0
Batch job created with ID: batch_3022d43d-b2b1-4595-a4d4-d58465c6a836
[8]:
while batch_response.status not in ["completed", "failed", "cancelled"]:
    time.sleep(3)
    print(f"Batch job status: {batch_response.status}...trying again in 3 seconds...")
    batch_response = client.batches.retrieve(batch_response.id)

if batch_response.status == "completed":
    print("Batch job completed successfully!")
    print(f"Request counts: {batch_response.request_counts}")

    result_file_id = batch_response.output_file_id
    file_response = client.files.content(result_file_id)
    result_content = file_response.read().decode("utf-8")

    results = [
        json.loads(line) for line in result_content.split("\n") if line.strip() != ""
    ]

    for result in results:
        print_highlight(f"Request {result['custom_id']}:")
        print_highlight(f"Response: {result['response']}")

    print_highlight("Cleaning up files...")
    # Only delete the result file ID since file_response is just content
    client.files.delete(result_file_id)
else:
    print_highlight(f"Batch job failed with status: {batch_response.status}")
    if hasattr(batch_response, "errors"):
        print_highlight(f"Errors: {batch_response.errors}")
[2025-05-29 08:41:37] Decode batch. #running-req: 2, #token: 90, token usage: 0.00, cuda graph: False, gen throughput (token/s): 88.08, #queue-req: 0
Batch job status: validating...trying again in 3 seconds...
[2025-05-29 08:41:40] INFO:     127.0.0.1:48630 - "GET /v1/batches/batch_3022d43d-b2b1-4595-a4d4-d58465c6a836 HTTP/1.1" 200 OK
Batch job completed successfully!
Request counts: BatchRequestCounts(completed=2, failed=0, total=2)
[2025-05-29 08:41:40] INFO:     127.0.0.1:48630 - "GET /v1/files/backend_result_file-479dfbe1-d51c-47aa-af88-2cb5436b7ba9/content HTTP/1.1" 200 OK
Request request-1:
Response: {'status_code': 200, 'request_id': 'batch_3022d43d-b2b1-4595-a4d4-d58465c6a836-req_0', 'body': {'id': 'batch_3022d43d-b2b1-4595-a4d4-d58465c6a836-req_0', 'object': 'chat.completion', 'created': 1748508097, 'model': 'qwen/qwen2.5-0.5b-instruct', 'choices': {'index': 0, 'message': {'role': 'assistant', 'content': "Sure, here's a programming-related joke for you:\n\nWhy did the programmer break up with the boss?\n\nBecause he wanted to make the code more readable!", 'tool_calls': None, 'reasoning_content': None}, 'logprobs': None, 'finish_reason': 'stop', 'matched_stop': 151645}, 'usage': {'prompt_tokens': 35, 'completion_tokens': 32, 'total_tokens': 67}, 'system_fingerprint': None}}
Request request-2:
Response: {'status_code': 200, 'request_id': 'batch_3022d43d-b2b1-4595-a4d4-d58465c6a836-req_1', 'body': {'id': 'batch_3022d43d-b2b1-4595-a4d4-d58465c6a836-req_1', 'object': 'chat.completion', 'created': 1748508097, 'model': 'qwen/qwen2.5-0.5b-instruct', 'choices': {'index': 0, 'message': {'role': 'assistant', 'content': 'Python is a high-level, interpreted programming language that is known for its readability and ease of use. Python was developed by Guido van Rossum and first released in 1991. It gained popularity in the late 1990', 'tool_calls': None, 'reasoning_content': None}, 'logprobs': None, 'finish_reason': 'length', 'matched_stop': None}, 'usage': {'prompt_tokens': 33, 'completion_tokens': 50, 'total_tokens': 83}, 'system_fingerprint': None}}
Cleaning up files...
[2025-05-29 08:41:40] INFO:     127.0.0.1:48630 - "DELETE /v1/files/backend_result_file-479dfbe1-d51c-47aa-af88-2cb5436b7ba9 HTTP/1.1" 200 OK

It takes a while to complete the batch job. You can use these two APIs to retrieve the batch job status or cancel the batch job.

  1. batches/{batch_id}: Retrieve the batch job status.

  2. batches/{batch_id}/cancel: Cancel the batch job.

Here is an example to check the batch job status.

[9]:
import json
import time
from openai import OpenAI

client = OpenAI(base_url=f"http://127.0.0.1:{port}/v1", api_key="None")

requests = []
for i in range(20):
    requests.append(
        {
            "custom_id": f"request-{i}",
            "method": "POST",
            "url": "/chat/completions",
            "body": {
                "model": "qwen/qwen2.5-0.5b-instruct",
                "messages": [
                    {
                        "role": "system",
                        "content": f"{i}: You are a helpful AI assistant",
                    },
                    {
                        "role": "user",
                        "content": "Write a detailed story about topic. Make it very long.",
                    },
                ],
                "max_tokens": 64,
            },
        }
    )

input_file_path = "batch_requests.jsonl"
with open(input_file_path, "w") as f:
    for req in requests:
        f.write(json.dumps(req) + "\n")

with open(input_file_path, "rb") as f:
    uploaded_file = client.files.create(file=f, purpose="batch")

batch_job = client.batches.create(
    input_file_id=uploaded_file.id,
    endpoint="/v1/chat/completions",
    completion_window="24h",
)

print_highlight(f"Created batch job with ID: {batch_job.id}")
print_highlight(f"Initial status: {batch_job.status}")

time.sleep(10)

max_checks = 5
for i in range(max_checks):
    batch_details = client.batches.retrieve(batch_id=batch_job.id)

    print_highlight(
        f"Batch job details (check {i+1} / {max_checks}) // ID: {batch_details.id} // Status: {batch_details.status} // Created at: {batch_details.created_at} // Input file ID: {batch_details.input_file_id} // Output file ID: {batch_details.output_file_id}"
    )
    print_highlight(
        f"<strong>Request counts: Total: {batch_details.request_counts.total} // Completed: {batch_details.request_counts.completed} // Failed: {batch_details.request_counts.failed}</strong>"
    )

    time.sleep(3)
[2025-05-29 08:41:40] INFO:     127.0.0.1:48638 - "POST /v1/files HTTP/1.1" 200 OK
[2025-05-29 08:41:40] INFO:     127.0.0.1:48638 - "POST /v1/batches HTTP/1.1" 200 OK
Created batch job with ID: batch_bf03a75d-f177-4f86-a106-4a19b546c605
Initial status: validating
[2025-05-29 08:41:40] Prefill batch. #new-seq: 20, #new-token: 610, #cached-token: 60, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-05-29 08:41:40] Decode batch. #running-req: 20, #token: 863, token usage: 0.04, cuda graph: False, gen throughput (token/s): 97.61, #queue-req: 0
[2025-05-29 08:41:40] Decode batch. #running-req: 20, #token: 1663, token usage: 0.08, cuda graph: False, gen throughput (token/s): 2590.45, #queue-req: 0
[2025-05-29 08:41:50] INFO:     127.0.0.1:60764 - "GET /v1/batches/batch_bf03a75d-f177-4f86-a106-4a19b546c605 HTTP/1.1" 200 OK
Batch job details (check 1 / 5) // ID: batch_bf03a75d-f177-4f86-a106-4a19b546c605 // Status: completed // Created at: 1748508100 // Input file ID: backend_input_file-e598c6bd-3d00-4950-9ff1-ad9046f27533 // Output file ID: backend_result_file-01daac6b-6129-4acc-9766-5050c3b4b3df
Request counts: Total: 20 // Completed: 20 // Failed: 0
[2025-05-29 08:41:53] INFO:     127.0.0.1:60764 - "GET /v1/batches/batch_bf03a75d-f177-4f86-a106-4a19b546c605 HTTP/1.1" 200 OK
Batch job details (check 2 / 5) // ID: batch_bf03a75d-f177-4f86-a106-4a19b546c605 // Status: completed // Created at: 1748508100 // Input file ID: backend_input_file-e598c6bd-3d00-4950-9ff1-ad9046f27533 // Output file ID: backend_result_file-01daac6b-6129-4acc-9766-5050c3b4b3df
Request counts: Total: 20 // Completed: 20 // Failed: 0
[2025-05-29 08:41:56] INFO:     127.0.0.1:60764 - "GET /v1/batches/batch_bf03a75d-f177-4f86-a106-4a19b546c605 HTTP/1.1" 200 OK
Batch job details (check 3 / 5) // ID: batch_bf03a75d-f177-4f86-a106-4a19b546c605 // Status: completed // Created at: 1748508100 // Input file ID: backend_input_file-e598c6bd-3d00-4950-9ff1-ad9046f27533 // Output file ID: backend_result_file-01daac6b-6129-4acc-9766-5050c3b4b3df
Request counts: Total: 20 // Completed: 20 // Failed: 0
[2025-05-29 08:41:59] INFO:     127.0.0.1:60764 - "GET /v1/batches/batch_bf03a75d-f177-4f86-a106-4a19b546c605 HTTP/1.1" 200 OK
Batch job details (check 4 / 5) // ID: batch_bf03a75d-f177-4f86-a106-4a19b546c605 // Status: completed // Created at: 1748508100 // Input file ID: backend_input_file-e598c6bd-3d00-4950-9ff1-ad9046f27533 // Output file ID: backend_result_file-01daac6b-6129-4acc-9766-5050c3b4b3df
Request counts: Total: 20 // Completed: 20 // Failed: 0
[2025-05-29 08:42:02] INFO:     127.0.0.1:60764 - "GET /v1/batches/batch_bf03a75d-f177-4f86-a106-4a19b546c605 HTTP/1.1" 200 OK
Batch job details (check 5 / 5) // ID: batch_bf03a75d-f177-4f86-a106-4a19b546c605 // Status: completed // Created at: 1748508100 // Input file ID: backend_input_file-e598c6bd-3d00-4950-9ff1-ad9046f27533 // Output file ID: backend_result_file-01daac6b-6129-4acc-9766-5050c3b4b3df
Request counts: Total: 20 // Completed: 20 // Failed: 0

Here is an example to cancel a batch job.

[10]:
import json
import time
from openai import OpenAI
import os

client = OpenAI(base_url=f"http://127.0.0.1:{port}/v1", api_key="None")

requests = []
for i in range(5000):
    requests.append(
        {
            "custom_id": f"request-{i}",
            "method": "POST",
            "url": "/chat/completions",
            "body": {
                "model": "qwen/qwen2.5-0.5b-instruct",
                "messages": [
                    {
                        "role": "system",
                        "content": f"{i}: You are a helpful AI assistant",
                    },
                    {
                        "role": "user",
                        "content": "Write a detailed story about topic. Make it very long.",
                    },
                ],
                "max_tokens": 128,
            },
        }
    )

input_file_path = "batch_requests.jsonl"
with open(input_file_path, "w") as f:
    for req in requests:
        f.write(json.dumps(req) + "\n")

with open(input_file_path, "rb") as f:
    uploaded_file = client.files.create(file=f, purpose="batch")

batch_job = client.batches.create(
    input_file_id=uploaded_file.id,
    endpoint="/v1/chat/completions",
    completion_window="24h",
)

print_highlight(f"Created batch job with ID: {batch_job.id}")
print_highlight(f"Initial status: {batch_job.status}")

time.sleep(10)

try:
    cancelled_job = client.batches.cancel(batch_id=batch_job.id)
    print_highlight(f"Cancellation initiated. Status: {cancelled_job.status}")
    assert cancelled_job.status == "cancelling"

    # Monitor the cancellation process
    while cancelled_job.status not in ["failed", "cancelled"]:
        time.sleep(3)
        cancelled_job = client.batches.retrieve(batch_job.id)
        print_highlight(f"Current status: {cancelled_job.status}")

    # Verify final status
    assert cancelled_job.status == "cancelled"
    print_highlight("Batch job successfully cancelled")

except Exception as e:
    print_highlight(f"Error during cancellation: {e}")
    raise e

finally:
    try:
        del_response = client.files.delete(uploaded_file.id)
        if del_response.deleted:
            print_highlight("Successfully cleaned up input file")
        if os.path.exists(input_file_path):
            os.remove(input_file_path)
            print_highlight("Successfully deleted local batch_requests.jsonl file")
    except Exception as e:
        print_highlight(f"Error cleaning up: {e}")
        raise e
[2025-05-29 08:42:05] INFO:     127.0.0.1:41810 - "POST /v1/files HTTP/1.1" 200 OK
[2025-05-29 08:42:05] INFO:     127.0.0.1:41810 - "POST /v1/batches HTTP/1.1" 200 OK
Created batch job with ID: batch_c3c9648c-8f7a-4400-a404-86c7fd24c305
Initial status: validating
[2025-05-29 08:42:06] Prefill batch. #new-seq: 23, #new-token: 110, #cached-token: 662, token usage: 0.03, #running-req: 0, #queue-req: 0
[2025-05-29 08:42:06] Prefill batch. #new-seq: 112, #new-token: 3360, #cached-token: 483, token usage: 0.03, #running-req: 23, #queue-req: 246
[2025-05-29 08:42:07] Prefill batch. #new-seq: 26, #new-token: 780, #cached-token: 130, token usage: 0.28, #running-req: 134, #queue-req: 4839
[2025-05-29 08:42:07] Decode batch. #running-req: 160, #token: 9105, token usage: 0.44, cuda graph: False, gen throughput (token/s): 164.66, #queue-req: 4839
[2025-05-29 08:42:07] Prefill batch. #new-seq: 3, #new-token: 90, #cached-token: 15, token usage: 0.46, #running-req: 159, #queue-req: 4836
[2025-05-29 08:42:07] Prefill batch. #new-seq: 3, #new-token: 90, #cached-token: 15, token usage: 0.48, #running-req: 158, #queue-req: 4833
[2025-05-29 08:42:07] Prefill batch. #new-seq: 1, #new-token: 30, #cached-token: 5, token usage: 0.51, #running-req: 160, #queue-req: 4832
[2025-05-29 08:42:07] Decode batch. #running-req: 158, #token: 15074, token usage: 0.74, cuda graph: False, gen throughput (token/s): 16323.98, #queue-req: 4832
[2025-05-29 08:42:08] Decode out of memory happened. #retracted_reqs: 24, #new_token_ratio: 0.5967 -> 0.9541
[2025-05-29 08:42:08] Decode batch. #running-req: 134, #token: 18514, token usage: 0.90, cuda graph: False, gen throughput (token/s): 17010.80, #queue-req: 4856
[2025-05-29 08:42:08] Decode out of memory happened. #retracted_reqs: 16, #new_token_ratio: 0.9351 -> 1.0000
[2025-05-29 08:42:08] Prefill batch. #new-seq: 11, #new-token: 332, #cached-token: 53, token usage: 0.89, #running-req: 118, #queue-req: 4861
[2025-05-29 08:42:08] Prefill batch. #new-seq: 118, #new-token: 3540, #cached-token: 590, token usage: 0.02, #running-req: 11, #queue-req: 4743
[2025-05-29 08:42:08] Decode batch. #running-req: 129, #token: 6474, token usage: 0.32, cuda graph: False, gen throughput (token/s): 12748.69, #queue-req: 4743
[2025-05-29 08:42:08] Prefill batch. #new-seq: 3, #new-token: 90, #cached-token: 15, token usage: 0.32, #running-req: 128, #queue-req: 4740
[2025-05-29 08:42:08] Prefill batch. #new-seq: 2, #new-token: 62, #cached-token: 8, token usage: 0.47, #running-req: 130, #queue-req: 4738
[2025-05-29 08:42:09] Prefill batch. #new-seq: 2, #new-token: 60, #cached-token: 10, token usage: 0.53, #running-req: 131, #queue-req: 4736
[2025-05-29 08:42:09] Decode batch. #running-req: 133, #token: 11792, token usage: 0.58, cuda graph: False, gen throughput (token/s): 13628.98, #queue-req: 4736
[2025-05-29 08:42:09] Decode batch. #running-req: 133, #token: 17112, token usage: 0.84, cuda graph: False, gen throughput (token/s): 14637.94, #queue-req: 4736
[2025-05-29 08:42:09] Prefill batch. #new-seq: 8, #new-token: 242, #cached-token: 38, token usage: 0.91, #running-req: 122, #queue-req: 4728
[2025-05-29 08:42:09] Prefill batch. #new-seq: 114, #new-token: 3552, #cached-token: 438, token usage: 0.06, #running-req: 15, #queue-req: 4614
[2025-05-29 08:42:09] Decode batch. #running-req: 129, #token: 5958, token usage: 0.29, cuda graph: False, gen throughput (token/s): 12839.37, #queue-req: 4614
[2025-05-29 08:42:09] Prefill batch. #new-seq: 14, #new-token: 430, #cached-token: 60, token usage: 0.30, #running-req: 128, #queue-req: 4600
[2025-05-29 08:42:09] Prefill batch. #new-seq: 5, #new-token: 155, #cached-token: 20, token usage: 0.35, #running-req: 138, #queue-req: 4595
[2025-05-29 08:42:10] Prefill batch. #new-seq: 1, #new-token: 30, #cached-token: 5, token usage: 0.51, #running-req: 142, #queue-req: 4594
[2025-05-29 08:42:10] Prefill batch. #new-seq: 2, #new-token: 60, #cached-token: 10, token usage: 0.50, #running-req: 140, #queue-req: 4592
[2025-05-29 08:42:10] Prefill batch. #new-seq: 2, #new-token: 60, #cached-token: 10, token usage: 0.50, #running-req: 141, #queue-req: 4590
[2025-05-29 08:42:10] Decode batch. #running-req: 143, #token: 11275, token usage: 0.55, cuda graph: False, gen throughput (token/s): 11335.73, #queue-req: 4590
[2025-05-29 08:42:10] Prefill batch. #new-seq: 2, #new-token: 62, #cached-token: 8, token usage: 0.54, #running-req: 142, #queue-req: 4588
[2025-05-29 08:42:10] Decode batch. #running-req: 141, #token: 16590, token usage: 0.81, cuda graph: False, gen throughput (token/s): 15327.58, #queue-req: 4588
[2025-05-29 08:42:11] Decode out of memory happened. #retracted_reqs: 20, #new_token_ratio: 0.7472 -> 1.0000
[2025-05-29 08:42:11] Prefill batch. #new-seq: 8, #new-token: 248, #cached-token: 32, token usage: 0.88, #running-req: 121, #queue-req: 4600
[2025-05-29 08:42:11] Prefill batch. #new-seq: 8, #new-token: 248, #cached-token: 32, token usage: 0.86, #running-req: 121, #queue-req: 4592
[2025-05-29 08:42:11] Prefill batch. #new-seq: 110, #new-token: 3486, #cached-token: 364, token usage: 0.04, #running-req: 18, #queue-req: 4482
[2025-05-29 08:42:11] Decode batch. #running-req: 128, #token: 4469, token usage: 0.22, cuda graph: False, gen throughput (token/s): 12586.21, #queue-req: 4482
[2025-05-29 08:42:11] Prefill batch. #new-seq: 5, #new-token: 153, #cached-token: 22, token usage: 0.26, #running-req: 126, #queue-req: 4477
[2025-05-29 08:42:11] Prefill batch. #new-seq: 1, #new-token: 30, #cached-token: 5, token usage: 0.28, #running-req: 130, #queue-req: 4476
[2025-05-29 08:42:11] Decode batch. #running-req: 131, #token: 9426, token usage: 0.46, cuda graph: False, gen throughput (token/s): 13988.87, #queue-req: 4476
[2025-05-29 08:42:11] Prefill batch. #new-seq: 3, #new-token: 90, #cached-token: 15, token usage: 0.46, #running-req: 130, #queue-req: 4473
[2025-05-29 08:42:11] Decode batch. #running-req: 133, #token: 14836, token usage: 0.72, cuda graph: False, gen throughput (token/s): 14606.13, #queue-req: 4473
[2025-05-29 08:42:12] Prefill batch. #new-seq: 6, #new-token: 183, #cached-token: 27, token usage: 0.90, #running-req: 124, #queue-req: 4467
[2025-05-29 08:42:12] Decode batch. #running-req: 130, #token: 18948, token usage: 0.93, cuda graph: False, gen throughput (token/s): 14152.48, #queue-req: 4467
[2025-05-29 08:42:12] Prefill batch. #new-seq: 8, #new-token: 242, #cached-token: 38, token usage: 0.88, #running-req: 122, #queue-req: 4459
[2025-05-29 08:42:12] Prefill batch. #new-seq: 108, #new-token: 3348, #cached-token: 432, token usage: 0.08, #running-req: 22, #queue-req: 4351
[2025-05-29 08:42:12] Prefill batch. #new-seq: 18, #new-token: 557, #cached-token: 73, token usage: 0.27, #running-req: 125, #queue-req: 4333
[2025-05-29 08:42:12] Prefill batch. #new-seq: 4, #new-token: 121, #cached-token: 19, token usage: 0.30, #running-req: 141, #queue-req: 4329
[2025-05-29 08:42:12] Decode batch. #running-req: 145, #token: 9101, token usage: 0.44, cuda graph: False, gen throughput (token/s): 12403.90, #queue-req: 4329
[2025-05-29 08:42:12] Prefill batch. #new-seq: 1, #new-token: 30, #cached-token: 5, token usage: 0.49, #running-req: 142, #queue-req: 4328
[2025-05-29 08:42:12] Prefill batch. #new-seq: 1, #new-token: 30, #cached-token: 5, token usage: 0.60, #running-req: 141, #queue-req: 4327
[2025-05-29 08:42:13] Decode batch. #running-req: 141, #token: 14301, token usage: 0.70, cuda graph: False, gen throughput (token/s): 14926.84, #queue-req: 4327
[2025-05-29 08:42:13] Decode batch. #running-req: 141, #token: 19941, token usage: 0.97, cuda graph: False, gen throughput (token/s): 15714.18, #queue-req: 4327
[2025-05-29 08:42:13] Decode out of memory happened. #retracted_reqs: 19, #new_token_ratio: 0.7441 -> 1.0000
[2025-05-29 08:42:13] Prefill batch. #new-seq: 7, #new-token: 217, #cached-token: 28, token usage: 0.88, #running-req: 122, #queue-req: 4339
[2025-05-29 08:42:13] Prefill batch. #new-seq: 6, #new-token: 186, #cached-token: 24, token usage: 0.85, #running-req: 123, #queue-req: 4333
[2025-05-29 08:42:13] Prefill batch. #new-seq: 8, #new-token: 244, #cached-token: 36, token usage: 0.83, #running-req: 121, #queue-req: 4325
[2025-05-29 08:42:13] Prefill batch. #new-seq: 104, #new-token: 3276, #cached-token: 364, token usage: 0.06, #running-req: 24, #queue-req: 4221
[2025-05-29 08:42:13] Prefill batch. #new-seq: 6, #new-token: 184, #cached-token: 26, token usage: 0.27, #running-req: 125, #queue-req: 4215
[2025-05-29 08:42:13] Decode batch. #running-req: 131, #token: 7211, token usage: 0.35, cuda graph: False, gen throughput (token/s): 11693.66, #queue-req: 4215
[2025-05-29 08:42:13] Prefill batch. #new-seq: 3, #new-token: 90, #cached-token: 15, token usage: 0.40, #running-req: 130, #queue-req: 4212
[2025-05-29 08:42:13] Prefill batch. #new-seq: 1, #new-token: 30, #cached-token: 5, token usage: 0.41, #running-req: 132, #queue-req: 4211
[2025-05-29 08:42:14] Prefill batch. #new-seq: 1, #new-token: 30, #cached-token: 5, token usage: 0.48, #running-req: 132, #queue-req: 4210
[2025-05-29 08:42:14] Decode batch. #running-req: 133, #token: 12461, token usage: 0.61, cuda graph: False, gen throughput (token/s): 13797.65, #queue-req: 4210
[2025-05-29 08:42:14] Prefill batch. #new-seq: 1, #new-token: 30, #cached-token: 5, token usage: 0.62, #running-req: 132, #queue-req: 4209
[2025-05-29 08:42:14] Prefill batch. #new-seq: 1, #new-token: 31, #cached-token: 4, token usage: 0.70, #running-req: 132, #queue-req: 4208
[2025-05-29 08:42:14] Decode batch. #running-req: 133, #token: 17671, token usage: 0.86, cuda graph: False, gen throughput (token/s): 14017.32, #queue-req: 4208
[2025-05-29 08:42:14] Prefill batch. #new-seq: 10, #new-token: 301, #cached-token: 49, token usage: 0.85, #running-req: 126, #queue-req: 4198
[2025-05-29 08:42:14] Prefill batch. #new-seq: 1, #new-token: 30, #cached-token: 5, token usage: 0.87, #running-req: 130, #queue-req: 4197
[2025-05-29 08:42:14] Prefill batch. #new-seq: 8, #new-token: 240, #cached-token: 40, token usage: 0.85, #running-req: 123, #queue-req: 4189
[2025-05-29 08:42:14] Prefill batch. #new-seq: 99, #new-token: 3158, #cached-token: 307, token usage: 0.11, #running-req: 31, #queue-req: 4090
[2025-05-29 08:42:15] Prefill batch. #new-seq: 19, #new-token: 604, #cached-token: 69, token usage: 0.28, #running-req: 124, #queue-req: 4071
[2025-05-29 08:42:15] Prefill batch. #new-seq: 3, #new-token: 91, #cached-token: 17, token usage: 0.32, #running-req: 141, #queue-req: 4068
[2025-05-29 08:42:15] Decode batch. #running-req: 144, #token: 6892, token usage: 0.34, cuda graph: False, gen throughput (token/s): 11426.18, #queue-req: 4068
[2025-05-29 08:42:15] Prefill batch. #new-seq: 1, #new-token: 30, #cached-token: 6, token usage: 0.41, #running-req: 143, #queue-req: 4067
[2025-05-29 08:42:15] Prefill batch. #new-seq: 3, #new-token: 90, #cached-token: 18, token usage: 0.44, #running-req: 141, #queue-req: 4064
[2025-05-29 08:42:15] Prefill batch. #new-seq: 2, #new-token: 60, #cached-token: 12, token usage: 0.51, #running-req: 142, #queue-req: 4062
[2025-05-29 08:42:15] Prefill batch. #new-seq: 1, #new-token: 30, #cached-token: 6, token usage: 0.54, #running-req: 142, #queue-req: 4061
[2025-05-29 08:42:15] Decode batch. #running-req: 143, #token: 11934, token usage: 0.58, cuda graph: False, gen throughput (token/s): 14242.58, #queue-req: 4061
[2025-05-29 08:42:15] INFO:     127.0.0.1:58660 - "POST /v1/batches/batch_c3c9648c-8f7a-4400-a404-86c7fd24c305/cancel HTTP/1.1" 200 OK
Cancellation initiated. Status: cancelling
[2025-05-29 08:42:18] INFO:     127.0.0.1:58660 - "GET /v1/batches/batch_c3c9648c-8f7a-4400-a404-86c7fd24c305 HTTP/1.1" 200 OK
Current status: cancelled
Batch job successfully cancelled
[2025-05-29 08:42:18] INFO:     127.0.0.1:58660 - "DELETE /v1/files/backend_input_file-34341a85-aeda-48b0-8f8c-919183ec9790 HTTP/1.1" 200 OK
Successfully cleaned up input file
Successfully deleted local batch_requests.jsonl file
[11]:
terminate_process(server_process)
[2025-05-29 08:42:18] Child process unexpectedly failed with an exit code 9. pid=2258470
[2025-05-29 08:42:18] Child process unexpectedly failed with an exit code 9. pid=2258391