OpenAI APIs - Completions#
SGLang provides OpenAI-compatible APIs to enable a smooth transition from OpenAI services to self-hosted local models. A complete reference for the API is available in the OpenAI API Reference.
This tutorial covers the following popular APIs:
chat/completions
completions
batches
Check out other tutorials to learn about vision APIs for vision-language models and embedding APIs for embedding models.
Launch A Server#
Launch the server in your terminal and wait for it to initialize.
[1]:
from sglang.utils import (
execute_shell_command,
wait_for_server,
terminate_process,
print_highlight,
)
server_process = execute_shell_command(
"python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --port 30000 --host 0.0.0.0"
)
wait_for_server("http://localhost:30000")
[2025-01-13 13:05:51] server_args=ServerArgs(model_path='meta-llama/Meta-Llama-3.1-8B-Instruct', tokenizer_path='meta-llama/Meta-Llama-3.1-8B-Instruct', tokenizer_mode='auto', load_format='auto', trust_remote_code=False, dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, quantization=None, context_length=None, device='cuda', served_model_name='meta-llama/Meta-Llama-3.1-8B-Instruct', chat_template=None, is_embedding=False, revision=None, skip_tokenizer_init=False, host='0.0.0.0', port=30000, mem_fraction_static=0.88, max_running_requests=None, max_total_tokens=None, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='lpm', schedule_conservativeness=1.0, cpu_offload_gb=0, prefill_only_one_req=False, tp_size=1, stream_interval=1, random_seed=130807188, constrained_json_whitespace_pattern=None, watchdog_timeout=300, download_dir=None, base_gpu_id=0, log_level='info', log_level_http=None, log_requests=False, show_time_cost=False, enable_metrics=False, decode_log_interval=40, dump_requests_folder=40, api_key=None, file_storage_pth='SGLang_storage', enable_cache_report=False, dp_size=1, load_balance_method='round_robin', ep_size=1, dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', lora_paths=None, max_loras_per_batch=8, attention_backend='flashinfer', sampling_backend='flashinfer', grammar_backend='outlines', speculative_draft_model_path=None, speculative_algorithm=None, speculative_num_steps=5, speculative_num_draft_tokens=64, speculative_eagle_topk=8, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, disable_radix_cache=False, disable_jump_forward=False, disable_cuda_graph=False, disable_cuda_graph_padding=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, disable_mla=False, disable_overlap_schedule=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_ep_moe=False, enable_torch_compile=False, torch_compile_max_bs=32, cuda_graph_max_bs=160, cuda_graph_bs=None, torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, num_continuous_decode_steps=1, delete_ckpt_after_loading=False)
[2025-01-13 13:06:09 TP0] Init torch distributed begin.
[2025-01-13 13:06:09 TP0] Load weight begin. avail mem=78.81 GB
[2025-01-13 13:06:10 TP0] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards: 0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 25% Completed | 1/4 [00:00<00:02, 1.15it/s]
Loading safetensors checkpoint shards: 50% Completed | 2/4 [00:01<00:01, 1.08it/s]
Loading safetensors checkpoint shards: 75% Completed | 3/4 [00:02<00:00, 1.07it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00, 1.45it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00, 1.29it/s]
[2025-01-13 13:06:14 TP0] Load weight end. type=LlamaForCausalLM, dtype=torch.bfloat16, avail mem=63.72 GB
[2025-01-13 13:06:14 TP0] KV Cache is allocated. K size: 27.13 GB, V size: 27.13 GB.
[2025-01-13 13:06:14 TP0] Memory pool end. avail mem=8.34 GB
[2025-01-13 13:06:14 TP0] Capture cuda graph begin. This can take up to several minutes.
100%|██████████| 23/23 [00:05<00:00, 4.57it/s]
[2025-01-13 13:06:19 TP0] Capture cuda graph end. Time elapsed: 5.04 s
[2025-01-13 13:06:19 TP0] max_total_num_tokens=444500, max_prefill_tokens=16384, max_running_requests=2049, context_len=131072
[2025-01-13 13:06:19] INFO: Started server process [208396]
[2025-01-13 13:06:19] INFO: Waiting for application startup.
[2025-01-13 13:06:19] INFO: Application startup complete.
[2025-01-13 13:06:19] INFO: Uvicorn running on http://0.0.0.0:30000 (Press CTRL+C to quit)
[2025-01-13 13:06:20] INFO: 127.0.0.1:42268 - "GET /v1/models HTTP/1.1" 200 OK
[2025-01-13 13:06:20] INFO: 127.0.0.1:42276 - "GET /get_model_info HTTP/1.1" 200 OK
[2025-01-13 13:06:20 TP0] Prefill batch. #new-seq: 1, #new-token: 7, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-01-13 13:06:23] INFO: 127.0.0.1:42290 - "POST /generate HTTP/1.1" 200 OK
[2025-01-13 13:06:23] The server is fired up and ready to roll!
NOTE: Typically, the server runs in a separate terminal.
In this notebook, we run the server and notebook code together, so their outputs are combined.
To improve clarity, the server logs are displayed in the original black color, while the notebook outputs are highlighted in blue.
Chat Completions#
Usage#
The server fully implements the OpenAI API. It will automatically apply the chat template specified in the Hugging Face tokenizer, if one is available. You can also specify a custom chat template with --chat-template
when launching the server.
[2]:
import openai
client = openai.Client(base_url="http://127.0.0.1:30000/v1", api_key="None")
response = client.chat.completions.create(
model="meta-llama/Meta-Llama-3.1-8B-Instruct",
messages=[
{"role": "user", "content": "List 3 countries and their capitals."},
],
temperature=0,
max_tokens=64,
)
print_highlight(f"Response: {response}")
[2025-01-13 13:06:26 TP0] Prefill batch. #new-seq: 1, #new-token: 42, #cached-token: 1, cache hit rate: 2.00%, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-01-13 13:06:26 TP0] Decode batch. #running-req: 1, #token: 76, token usage: 0.00, gen throughput (token/s): 5.72, #queue-req: 0
[2025-01-13 13:06:26] INFO: 127.0.0.1:57586 - "POST /v1/chat/completions HTTP/1.1" 200 OK
Parameters#
The chat completions API accepts OpenAI Chat Completions API’s parameters. Refer to OpenAI Chat Completions API for more details.
Here is an example of a detailed chat completion request:
[3]:
response = client.chat.completions.create(
model="meta-llama/Meta-Llama-3.1-8B-Instruct",
messages=[
{
"role": "system",
"content": "You are a knowledgeable historian who provides concise responses.",
},
{"role": "user", "content": "Tell me about ancient Rome"},
{
"role": "assistant",
"content": "Ancient Rome was a civilization centered in Italy.",
},
{"role": "user", "content": "What were their major achievements?"},
],
temperature=0.3, # Lower temperature for more focused responses
max_tokens=128, # Reasonable length for a concise response
top_p=0.95, # Slightly higher for better fluency
presence_penalty=0.2, # Mild penalty to avoid repetition
frequency_penalty=0.2, # Mild penalty for more natural language
n=1, # Single response is usually more stable
seed=42, # Keep for reproducibility
)
print_highlight(response.choices[0].message.content)
[2025-01-13 13:06:26 TP0] Prefill batch. #new-seq: 1, #new-token: 51, #cached-token: 25, cache hit rate: 20.63%, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-01-13 13:06:26 TP0] frequency_penalty, presence_penalty, and repetition_penalty are not supported when using the default overlap scheduler. They will be ignored. Please add `--disable-overlap` when launching the server if you need these features. The speed will be slower in that case.
[2025-01-13 13:06:27 TP0] Decode batch. #running-req: 1, #token: 106, token usage: 0.00, gen throughput (token/s): 126.25, #queue-req: 0
[2025-01-13 13:06:27 TP0] Decode batch. #running-req: 1, #token: 146, token usage: 0.00, gen throughput (token/s): 142.26, #queue-req: 0
[2025-01-13 13:06:27 TP0] Decode batch. #running-req: 1, #token: 186, token usage: 0.00, gen throughput (token/s): 141.80, #queue-req: 0
[2025-01-13 13:06:27] INFO: 127.0.0.1:57586 - "POST /v1/chat/completions HTTP/1.1" 200 OK
1. **Engineering and Architecture**: They built impressive structures like the Colosseum, Pantheon, and aqueducts, showcasing their engineering skills.
2. **Law and Governance**: The Romans developed the Twelve Tables, a foundation for modern law, and established a system of governance that included the Senate and the Assemblies.
3. **Military Conquests**: Rome expanded its territories through a series of military campaigns, creating a vast empire that lasted for centuries.
4. **Language and Literature**: Latin became the language of the empire, and Roman authors like Cicero, Virgil, and O
Streaming mode is also supported.
[4]:
stream = client.chat.completions.create(
model="meta-llama/Meta-Llama-3.1-8B-Instruct",
messages=[{"role": "user", "content": "Say this is a test"}],
stream=True,
)
for chunk in stream:
if chunk.choices[0].delta.content is not None:
print(chunk.choices[0].delta.content, end="")
[2025-01-13 13:06:27] INFO: 127.0.0.1:57586 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-01-13 13:06:27 TP0] Prefill batch. #new-seq: 1, #new-token: 10, #cached-token: 30, cache hit rate: 33.73%, token usage: 0.00, #running-req: 0, #queue-req: 0
This is only a test.
Completions#
Usage#
Completions API is similar to Chat Completions API, but without the messages
parameter or chat templates.
[5]:
response = client.completions.create(
model="meta-llama/Meta-Llama-3.1-8B-Instruct",
prompt="List 3 countries and their capitals.",
temperature=0,
max_tokens=64,
n=1,
stop=None,
)
print_highlight(f"Response: {response}")
[2025-01-13 13:06:27 TP0] Prefill batch. #new-seq: 1, #new-token: 8, #cached-token: 1, cache hit rate: 32.57%, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-01-13 13:06:28 TP0] Decode batch. #running-req: 1, #token: 24, token usage: 0.00, gen throughput (token/s): 130.87, #queue-req: 0
[2025-01-13 13:06:28 TP0] Decode batch. #running-req: 1, #token: 64, token usage: 0.00, gen throughput (token/s): 146.65, #queue-req: 0
[2025-01-13 13:06:28] INFO: 127.0.0.1:57586 - "POST /v1/completions HTTP/1.1" 200 OK
Parameters#
The completions API accepts OpenAI Completions API’s parameters. Refer to OpenAI Completions API for more details.
Here is an example of a detailed completions request:
[6]:
response = client.completions.create(
model="meta-llama/Meta-Llama-3.1-8B-Instruct",
prompt="Write a short story about a space explorer.",
temperature=0.7, # Moderate temperature for creative writing
max_tokens=150, # Longer response for a story
top_p=0.9, # Balanced diversity in word choice
stop=["\n\n", "THE END"], # Multiple stop sequences
presence_penalty=0.3, # Encourage novel elements
frequency_penalty=0.3, # Reduce repetitive phrases
n=1, # Generate one completion
seed=123, # For reproducible results
)
print_highlight(f"Response: {response}")
[2025-01-13 13:06:28 TP0] Prefill batch. #new-seq: 1, #new-token: 9, #cached-token: 1, cache hit rate: 31.35%, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-01-13 13:06:28 TP0] frequency_penalty, presence_penalty, and repetition_penalty are not supported when using the default overlap scheduler. They will be ignored. Please add `--disable-overlap` when launching the server if you need these features. The speed will be slower in that case.
[2025-01-13 13:06:28 TP0] Decode batch. #running-req: 1, #token: 41, token usage: 0.00, gen throughput (token/s): 138.57, #queue-req: 0
[2025-01-13 13:06:28 TP0] Decode batch. #running-req: 1, #token: 81, token usage: 0.00, gen throughput (token/s): 145.36, #queue-req: 0
[2025-01-13 13:06:29 TP0] Decode batch. #running-req: 1, #token: 121, token usage: 0.00, gen throughput (token/s): 144.05, #queue-req: 0
[2025-01-13 13:06:29] INFO: 127.0.0.1:57586 - "POST /v1/completions HTTP/1.1" 200 OK
Structured Outputs (JSON, Regex, EBNF)#
You can specify a JSON schema, regular expression or EBNF to constrain the model output. The model output will be guaranteed to follow the given constraints. Only one constraint parameter (json_schema
, regex
, or ebnf
) can be specified for a request.
SGLang supports two grammar backends:
Outlines (default): Supports JSON schema and regular expression constraints.
XGrammar: Supports JSON schema and EBNF constraints.
XGrammar currently uses the GGML BNF format
Initialize the XGrammar backend using --grammar-backend xgrammar
flag
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \
--port 30000 --host 0.0.0.0 --grammar-backend [xgrammar|outlines] # xgrammar or outlines (default: outlines)
JSON#
[7]:
import json
json_schema = json.dumps(
{
"type": "object",
"properties": {
"name": {"type": "string", "pattern": "^[\\w]+$"},
"population": {"type": "integer"},
},
"required": ["name", "population"],
}
)
response = client.chat.completions.create(
model="meta-llama/Meta-Llama-3.1-8B-Instruct",
messages=[
{
"role": "user",
"content": "Give me the information of the capital of France in the JSON format.",
},
],
temperature=0,
max_tokens=128,
response_format={
"type": "json_schema",
"json_schema": {"name": "foo", "schema": json.loads(json_schema)},
},
)
print_highlight(response.choices[0].message.content)
[2025-01-13 13:06:29 TP0] Decode batch. #running-req: 0, #token: 0, token usage: 0.00, gen throughput (token/s): 142.44, #queue-req: 0
[2025-01-13 13:06:29 TP0] Prefill batch. #new-seq: 1, #new-token: 19, #cached-token: 30, cache hit rate: 37.61%, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-01-13 13:06:29] INFO: 127.0.0.1:57586 - "POST /v1/chat/completions HTTP/1.1" 200 OK
Regular expression (use default “outlines” backend)#
[8]:
response = client.chat.completions.create(
model="meta-llama/Meta-Llama-3.1-8B-Instruct",
messages=[
{"role": "user", "content": "What is the capital of France?"},
],
temperature=0,
max_tokens=128,
extra_body={"regex": "(Paris|London)"},
)
print_highlight(response.choices[0].message.content)
[2025-01-13 13:06:29 TP0] Prefill batch. #new-seq: 1, #new-token: 12, #cached-token: 30, cache hit rate: 42.75%, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-01-13 13:06:29] INFO: 127.0.0.1:57586 - "POST /v1/chat/completions HTTP/1.1" 200 OK
EBNF (use “xgrammar” backend)#
[9]:
# terminate the existing server(that's using default outlines backend) for this demo
terminate_process(server_process)
# start new server with xgrammar backend
server_process = execute_shell_command(
"python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --port 30000 --host 0.0.0.0 --grammar-backend xgrammar"
)
wait_for_server("http://localhost:30000")
# EBNF example
ebnf_grammar = r"""
root ::= "Hello" | "Hi" | "Hey"
"""
response = client.chat.completions.create(
model="meta-llama/Meta-Llama-3.1-8B-Instruct",
messages=[
{"role": "system", "content": "You are a helpful EBNF test bot."},
{"role": "user", "content": "Say a greeting."},
],
temperature=0,
max_tokens=32,
extra_body={"ebnf": ebnf_grammar},
)
print_highlight(response.choices[0].message.content)
[2025-01-13 13:06:44] server_args=ServerArgs(model_path='meta-llama/Meta-Llama-3.1-8B-Instruct', tokenizer_path='meta-llama/Meta-Llama-3.1-8B-Instruct', tokenizer_mode='auto', load_format='auto', trust_remote_code=False, dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, quantization=None, context_length=None, device='cuda', served_model_name='meta-llama/Meta-Llama-3.1-8B-Instruct', chat_template=None, is_embedding=False, revision=None, skip_tokenizer_init=False, host='0.0.0.0', port=30000, mem_fraction_static=0.88, max_running_requests=None, max_total_tokens=None, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='lpm', schedule_conservativeness=1.0, cpu_offload_gb=0, prefill_only_one_req=False, tp_size=1, stream_interval=1, random_seed=744821931, constrained_json_whitespace_pattern=None, watchdog_timeout=300, download_dir=None, base_gpu_id=0, log_level='info', log_level_http=None, log_requests=False, show_time_cost=False, enable_metrics=False, decode_log_interval=40, dump_requests_folder=40, api_key=None, file_storage_pth='SGLang_storage', enable_cache_report=False, dp_size=1, load_balance_method='round_robin', ep_size=1, dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', lora_paths=None, max_loras_per_batch=8, attention_backend='flashinfer', sampling_backend='flashinfer', grammar_backend='xgrammar', speculative_draft_model_path=None, speculative_algorithm=None, speculative_num_steps=5, speculative_num_draft_tokens=64, speculative_eagle_topk=8, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, disable_radix_cache=False, disable_jump_forward=False, disable_cuda_graph=False, disable_cuda_graph_padding=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, disable_mla=False, disable_overlap_schedule=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_ep_moe=False, enable_torch_compile=False, torch_compile_max_bs=32, cuda_graph_max_bs=160, cuda_graph_bs=None, torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, num_continuous_decode_steps=1, delete_ckpt_after_loading=False)
[2025-01-13 13:07:02 TP0] Init torch distributed begin.
[2025-01-13 13:07:02 TP0] Load weight begin. avail mem=78.81 GB
[2025-01-13 13:07:04 TP0] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards: 0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 25% Completed | 1/4 [00:00<00:02, 1.16it/s]
Loading safetensors checkpoint shards: 50% Completed | 2/4 [00:01<00:01, 1.09it/s]
Loading safetensors checkpoint shards: 75% Completed | 3/4 [00:02<00:00, 1.08it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00, 1.46it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00, 1.30it/s]
[2025-01-13 13:07:07 TP0] Load weight end. type=LlamaForCausalLM, dtype=torch.bfloat16, avail mem=63.72 GB
[2025-01-13 13:07:07 TP0] KV Cache is allocated. K size: 27.13 GB, V size: 27.13 GB.
[2025-01-13 13:07:07 TP0] Memory pool end. avail mem=8.34 GB
[2025-01-13 13:07:07 TP0] Capture cuda graph begin. This can take up to several minutes.
100%|██████████| 23/23 [00:04<00:00, 4.70it/s]
[2025-01-13 13:07:12 TP0] Capture cuda graph end. Time elapsed: 4.90 s
[2025-01-13 13:07:13 TP0] max_total_num_tokens=444500, max_prefill_tokens=16384, max_running_requests=2049, context_len=131072
[2025-01-13 13:07:13] INFO: Started server process [209345]
[2025-01-13 13:07:13] INFO: Waiting for application startup.
[2025-01-13 13:07:13] INFO: Application startup complete.
[2025-01-13 13:07:13] INFO: Uvicorn running on http://0.0.0.0:30000 (Press CTRL+C to quit)
[2025-01-13 13:07:14] INFO: 127.0.0.1:48182 - "GET /v1/models HTTP/1.1" 200 OK
[2025-01-13 13:07:14] INFO: 127.0.0.1:48194 - "GET /get_model_info HTTP/1.1" 200 OK
[2025-01-13 13:07:14 TP0] Prefill batch. #new-seq: 1, #new-token: 7, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-01-13 13:07:16] INFO: 127.0.0.1:48196 - "POST /generate HTTP/1.1" 200 OK
[2025-01-13 13:07:16] The server is fired up and ready to roll!
NOTE: Typically, the server runs in a separate terminal.
In this notebook, we run the server and notebook code together, so their outputs are combined.
To improve clarity, the server logs are displayed in the original black color, while the notebook outputs are highlighted in blue.
[2025-01-13 13:07:19 TP0] Prefill batch. #new-seq: 1, #new-token: 48, #cached-token: 1, cache hit rate: 1.79%, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-01-13 13:07:19] INFO: 127.0.0.1:48206 - "POST /v1/chat/completions HTTP/1.1" 200 OK
Batches#
Batches API for chat completions and completions are also supported. You can upload your requests in jsonl
files, create a batch job, and retrieve the results when the batch job is completed (which takes longer but costs less).
The batches APIs are:
batches
batches/{batch_id}/cancel
batches/{batch_id}
Here is an example of a batch job for chat completions, completions are similar.
[10]:
import json
import time
from openai import OpenAI
client = OpenAI(base_url="http://127.0.0.1:30000/v1", api_key="None")
requests = [
{
"custom_id": "request-1",
"method": "POST",
"url": "/chat/completions",
"body": {
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
"messages": [
{"role": "user", "content": "Tell me a joke about programming"}
],
"max_tokens": 50,
},
},
{
"custom_id": "request-2",
"method": "POST",
"url": "/chat/completions",
"body": {
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
"messages": [{"role": "user", "content": "What is Python?"}],
"max_tokens": 50,
},
},
]
input_file_path = "batch_requests.jsonl"
with open(input_file_path, "w") as f:
for req in requests:
f.write(json.dumps(req) + "\n")
with open(input_file_path, "rb") as f:
file_response = client.files.create(file=f, purpose="batch")
batch_response = client.batches.create(
input_file_id=file_response.id,
endpoint="/v1/chat/completions",
completion_window="24h",
)
print_highlight(f"Batch job created with ID: {batch_response.id}")
[2025-01-13 13:07:19] INFO: 127.0.0.1:48214 - "POST /v1/files HTTP/1.1" 200 OK
[2025-01-13 13:07:19] INFO: 127.0.0.1:48214 - "POST /v1/batches HTTP/1.1" 200 OK
[2025-01-13 13:07:19 TP0] Prefill batch. #new-seq: 2, #new-token: 30, #cached-token: 50, cache hit rate: 37.50%, token usage: 0.00, #running-req: 0, #queue-req: 0
[11]:
while batch_response.status not in ["completed", "failed", "cancelled"]:
time.sleep(3)
print(f"Batch job status: {batch_response.status}...trying again in 3 seconds...")
batch_response = client.batches.retrieve(batch_response.id)
if batch_response.status == "completed":
print("Batch job completed successfully!")
print(f"Request counts: {batch_response.request_counts}")
result_file_id = batch_response.output_file_id
file_response = client.files.content(result_file_id)
result_content = file_response.read().decode("utf-8")
results = [
json.loads(line) for line in result_content.split("\n") if line.strip() != ""
]
for result in results:
print_highlight(f"Request {result['custom_id']}:")
print_highlight(f"Response: {result['response']}")
print_highlight("Cleaning up files...")
# Only delete the result file ID since file_response is just content
client.files.delete(result_file_id)
else:
print_highlight(f"Batch job failed with status: {batch_response.status}")
if hasattr(batch_response, "errors"):
print_highlight(f"Errors: {batch_response.errors}")
[2025-01-13 13:07:19 TP0] Decode batch. #running-req: 2, #token: 112, token usage: 0.00, gen throughput (token/s): 10.18, #queue-req: 0
Batch job status: validating...trying again in 3 seconds...
[2025-01-13 13:07:22] INFO: 127.0.0.1:48214 - "GET /v1/batches/batch_c751d37f-370e-40fc-8323-d6282a366e83 HTTP/1.1" 200 OK
Batch job completed successfully!
Request counts: BatchRequestCounts(completed=2, failed=0, total=2)
[2025-01-13 13:07:22] INFO: 127.0.0.1:48214 - "GET /v1/files/backend_result_file-edb34d66-d90c-462b-8e24-82b968f49fc0/content HTTP/1.1" 200 OK
[2025-01-13 13:07:22] INFO: 127.0.0.1:48214 - "DELETE /v1/files/backend_result_file-edb34d66-d90c-462b-8e24-82b968f49fc0 HTTP/1.1" 200 OK
It takes a while to complete the batch job. You can use these two APIs to retrieve the batch job status or cancel the batch job.
batches/{batch_id}
: Retrieve the batch job status.batches/{batch_id}/cancel
: Cancel the batch job.
Here is an example to check the batch job status.
[12]:
import json
import time
from openai import OpenAI
client = OpenAI(base_url="http://127.0.0.1:30000/v1", api_key="None")
requests = []
for i in range(100):
requests.append(
{
"custom_id": f"request-{i}",
"method": "POST",
"url": "/chat/completions",
"body": {
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
"messages": [
{
"role": "system",
"content": f"{i}: You are a helpful AI assistant",
},
{
"role": "user",
"content": "Write a detailed story about topic. Make it very long.",
},
],
"max_tokens": 500,
},
}
)
input_file_path = "batch_requests.jsonl"
with open(input_file_path, "w") as f:
for req in requests:
f.write(json.dumps(req) + "\n")
with open(input_file_path, "rb") as f:
uploaded_file = client.files.create(file=f, purpose="batch")
batch_job = client.batches.create(
input_file_id=uploaded_file.id,
endpoint="/v1/chat/completions",
completion_window="24h",
)
print_highlight(f"Created batch job with ID: {batch_job.id}")
print_highlight(f"Initial status: {batch_job.status}")
time.sleep(10)
max_checks = 5
for i in range(max_checks):
batch_details = client.batches.retrieve(batch_id=batch_job.id)
print_highlight(
f"Batch job details (check {i+1} / {max_checks}) // ID: {batch_details.id} // Status: {batch_details.status} // Created at: {batch_details.created_at} // Input file ID: {batch_details.input_file_id} // Output file ID: {batch_details.output_file_id}"
)
print_highlight(
f"<strong>Request counts: Total: {batch_details.request_counts.total} // Completed: {batch_details.request_counts.completed} // Failed: {batch_details.request_counts.failed}</strong>"
)
time.sleep(3)
[2025-01-13 13:07:22] INFO: 127.0.0.1:48218 - "POST /v1/files HTTP/1.1" 200 OK
[2025-01-13 13:07:22] INFO: 127.0.0.1:48218 - "POST /v1/batches HTTP/1.1" 200 OK
[2025-01-13 13:07:22 TP0] Prefill batch. #new-seq: 6, #new-token: 180, #cached-token: 150, cache hit rate: 43.13%, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-01-13 13:07:22 TP0] Prefill batch. #new-seq: 94, #new-token: 2820, #cached-token: 2350, cache hit rate: 45.26%, token usage: 0.00, #running-req: 6, #queue-req: 0
[2025-01-13 13:07:23 TP0] Decode batch. #running-req: 100, #token: 5125, token usage: 0.01, gen throughput (token/s): 676.44, #queue-req: 0
[2025-01-13 13:07:23 TP0] Decode batch. #running-req: 100, #token: 9125, token usage: 0.02, gen throughput (token/s): 11767.40, #queue-req: 0
[2025-01-13 13:07:23 TP0] Decode batch. #running-req: 100, #token: 13125, token usage: 0.03, gen throughput (token/s): 11519.25, #queue-req: 0
[2025-01-13 13:07:24 TP0] Decode batch. #running-req: 100, #token: 17125, token usage: 0.04, gen throughput (token/s): 11263.89, #queue-req: 0
[2025-01-13 13:07:24 TP0] Decode batch. #running-req: 100, #token: 21125, token usage: 0.05, gen throughput (token/s): 11011.44, #queue-req: 0
[2025-01-13 13:07:24 TP0] Decode batch. #running-req: 100, #token: 25125, token usage: 0.06, gen throughput (token/s): 10752.32, #queue-req: 0
[2025-01-13 13:07:25 TP0] Decode batch. #running-req: 100, #token: 29125, token usage: 0.07, gen throughput (token/s): 10516.25, #queue-req: 0
[2025-01-13 13:07:25 TP0] Decode batch. #running-req: 100, #token: 33125, token usage: 0.07, gen throughput (token/s): 10295.15, #queue-req: 0
[2025-01-13 13:07:25 TP0] Decode batch. #running-req: 100, #token: 37125, token usage: 0.08, gen throughput (token/s): 10073.97, #queue-req: 0
[2025-01-13 13:07:26 TP0] Decode batch. #running-req: 100, #token: 41125, token usage: 0.09, gen throughput (token/s): 9861.03, #queue-req: 0
[2025-01-13 13:07:26 TP0] Decode batch. #running-req: 100, #token: 45125, token usage: 0.10, gen throughput (token/s): 9664.78, #queue-req: 0
[2025-01-13 13:07:27 TP0] Decode batch. #running-req: 100, #token: 49125, token usage: 0.11, gen throughput (token/s): 9485.52, #queue-req: 0
[2025-01-13 13:07:27 TP0] Decode batch. #running-req: 0, #token: 0, token usage: 0.00, gen throughput (token/s): 9266.11, #queue-req: 0
[2025-01-13 13:07:32] INFO: 127.0.0.1:54286 - "GET /v1/batches/batch_01fc9722-333d-40ae-90b9-6c4c3a5de5cd HTTP/1.1" 200 OK
[2025-01-13 13:07:35] INFO: 127.0.0.1:54286 - "GET /v1/batches/batch_01fc9722-333d-40ae-90b9-6c4c3a5de5cd HTTP/1.1" 200 OK
[2025-01-13 13:07:38] INFO: 127.0.0.1:54286 - "GET /v1/batches/batch_01fc9722-333d-40ae-90b9-6c4c3a5de5cd HTTP/1.1" 200 OK
[2025-01-13 13:07:41] INFO: 127.0.0.1:54286 - "GET /v1/batches/batch_01fc9722-333d-40ae-90b9-6c4c3a5de5cd HTTP/1.1" 200 OK
[2025-01-13 13:07:44] INFO: 127.0.0.1:54286 - "GET /v1/batches/batch_01fc9722-333d-40ae-90b9-6c4c3a5de5cd HTTP/1.1" 200 OK
Here is an example to cancel a batch job.
[13]:
import json
import time
from openai import OpenAI
import os
client = OpenAI(base_url="http://127.0.0.1:30000/v1", api_key="None")
requests = []
for i in range(500):
requests.append(
{
"custom_id": f"request-{i}",
"method": "POST",
"url": "/chat/completions",
"body": {
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
"messages": [
{
"role": "system",
"content": f"{i}: You are a helpful AI assistant",
},
{
"role": "user",
"content": "Write a detailed story about topic. Make it very long.",
},
],
"max_tokens": 500,
},
}
)
input_file_path = "batch_requests.jsonl"
with open(input_file_path, "w") as f:
for req in requests:
f.write(json.dumps(req) + "\n")
with open(input_file_path, "rb") as f:
uploaded_file = client.files.create(file=f, purpose="batch")
batch_job = client.batches.create(
input_file_id=uploaded_file.id,
endpoint="/v1/chat/completions",
completion_window="24h",
)
print_highlight(f"Created batch job with ID: {batch_job.id}")
print_highlight(f"Initial status: {batch_job.status}")
time.sleep(10)
try:
cancelled_job = client.batches.cancel(batch_id=batch_job.id)
print_highlight(f"Cancellation initiated. Status: {cancelled_job.status}")
assert cancelled_job.status == "cancelling"
# Monitor the cancellation process
while cancelled_job.status not in ["failed", "cancelled"]:
time.sleep(3)
cancelled_job = client.batches.retrieve(batch_job.id)
print_highlight(f"Current status: {cancelled_job.status}")
# Verify final status
assert cancelled_job.status == "cancelled"
print_highlight("Batch job successfully cancelled")
except Exception as e:
print_highlight(f"Error during cancellation: {e}")
raise e
finally:
try:
del_response = client.files.delete(uploaded_file.id)
if del_response.deleted:
print_highlight("Successfully cleaned up input file")
if os.path.exists(input_file_path):
os.remove(input_file_path)
print_highlight("Successfully deleted local batch_requests.jsonl file")
except Exception as e:
print_highlight(f"Error cleaning up: {e}")
raise e
[2025-01-13 13:07:47] INFO: 127.0.0.1:41082 - "POST /v1/files HTTP/1.1" 200 OK
[2025-01-13 13:07:47] INFO: 127.0.0.1:41082 - "POST /v1/batches HTTP/1.1" 200 OK
[2025-01-13 13:07:47 TP0] Prefill batch. #new-seq: 7, #new-token: 7, #cached-token: 378, cache hit rate: 48.65%, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-01-13 13:07:47 TP0] Prefill batch. #new-seq: 267, #new-token: 5313, #cached-token: 9372, cache hit rate: 59.41%, token usage: 0.01, #running-req: 7, #queue-req: 0
[2025-01-13 13:07:48 TP0] Prefill batch. #new-seq: 226, #new-token: 6780, #cached-token: 5650, cache hit rate: 54.17%, token usage: 0.02, #running-req: 274, #queue-req: 0
[2025-01-13 13:07:49 TP0] Decode batch. #running-req: 500, #token: 35525, token usage: 0.08, gen throughput (token/s): 934.01, #queue-req: 0
[2025-01-13 13:07:49 TP0] Decode batch. #running-req: 500, #token: 55525, token usage: 0.12, gen throughput (token/s): 26362.65, #queue-req: 0
[2025-01-13 13:07:50 TP0] Decode batch. #running-req: 500, #token: 75525, token usage: 0.17, gen throughput (token/s): 24778.30, #queue-req: 0
[2025-01-13 13:07:51 TP0] Decode batch. #running-req: 500, #token: 95525, token usage: 0.21, gen throughput (token/s): 23693.58, #queue-req: 0
[2025-01-13 13:07:52 TP0] Decode batch. #running-req: 500, #token: 115525, token usage: 0.26, gen throughput (token/s): 22592.69, #queue-req: 0
[2025-01-13 13:07:53 TP0] Decode batch. #running-req: 500, #token: 135525, token usage: 0.30, gen throughput (token/s): 21671.40, #queue-req: 0
[2025-01-13 13:07:54 TP0] Decode batch. #running-req: 500, #token: 155525, token usage: 0.35, gen throughput (token/s): 20805.78, #queue-req: 0
[2025-01-13 13:07:55 TP0] Decode batch. #running-req: 500, #token: 175525, token usage: 0.39, gen throughput (token/s): 19975.06, #queue-req: 0
[2025-01-13 13:07:56 TP0] Decode batch. #running-req: 500, #token: 195525, token usage: 0.44, gen throughput (token/s): 19243.23, #queue-req: 0
[2025-01-13 13:07:57 TP0] Decode batch. #running-req: 500, #token: 215525, token usage: 0.48, gen throughput (token/s): 18601.41, #queue-req: 0
[2025-01-13 13:07:57] INFO: 127.0.0.1:55676 - "POST /v1/batches/batch_f85adaf4-def2-4309-9c30-28d097c12e10/cancel HTTP/1.1" 200 OK
[2025-01-13 13:08:00] INFO: 127.0.0.1:55676 - "GET /v1/batches/batch_f85adaf4-def2-4309-9c30-28d097c12e10 HTTP/1.1" 200 OK
[2025-01-13 13:08:00] INFO: 127.0.0.1:55676 - "DELETE /v1/files/backend_input_file-f33e0a4f-69ab-499c-845c-da855615ddb5 HTTP/1.1" 200 OK
[14]:
terminate_process(server_process)