Sending Requests#

This notebook provides a quick-start guide to use SGLang in chat completions after installation.

For Vision Language Models, see OpenAI APIs - Vision.
For Embedding Models, see OpenAI APIs - Embedding and Encode (embedding model).
For Reward Models, see Classify (reward model).

Launch A Server#

[1]:

from sglang.test.test_utils import is_in_ci
from sglang.utils import wait_for_server, print_highlight, terminate_process

if is_in_ci():
    from patch import launch_server_cmd
else:
    from sglang.utils import launch_server_cmd

# This is equivalent to running the following command in your terminal

# python3 -m sglang.launch_server --model-path qwen/qwen2.5-0.5b-instruct --host 0.0.0.0

server_process, port = launch_server_cmd(
    """
python3 -m sglang.launch_server --model-path qwen/qwen2.5-0.5b-instruct \
 --host 0.0.0.0
"""
)

wait_for_server(f"http://localhost:{port}")

[2025-06-30 02:31:33] server_args=ServerArgs(model_path='qwen/qwen2.5-0.5b-instruct', tokenizer_path='qwen/qwen2.5-0.5b-instruct', tokenizer_mode='auto', skip_tokenizer_init=False, load_format='auto', model_loader_extra_config='{}', trust_remote_code=False, dtype='auto', kv_cache_dtype='auto', quantization=None, quantization_param_path=None, context_length=None, device='cuda', served_model_name='qwen/qwen2.5-0.5b-instruct', chat_template=None, completion_template=None, is_embedding=False, enable_multimodal=None, revision=None, hybrid_kvcache_ratio=None, impl='auto', host='0.0.0.0', port=36255, mem_fraction_static=0.874, max_running_requests=200, max_total_tokens=20480, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='fcfs', schedule_conservativeness=1.0, cpu_offload_gb=0, page_size=1, tp_size=1, pp_size=1, max_micro_batch_size=None, stream_interval=1, stream_output=False, random_seed=583546336, constrained_json_whitespace_pattern=None, watchdog_timeout=300, dist_timeout=None, download_dir=None, base_gpu_id=0, gpu_id_step=1, sleep_on_idle=False, log_level='info', log_level_http=None, log_requests=False, log_requests_level=0, show_time_cost=False, enable_metrics=False, bucket_time_to_first_token=None, bucket_e2e_request_latency=None, bucket_inter_token_latency=None, collect_tokens_histogram=False, decode_log_interval=40, enable_request_time_stats_logging=False, kv_events_config=None, api_key=None, file_storage_path='sglang_storage', enable_cache_report=False, reasoning_parser=None, tool_call_parser=None, dp_size=1, load_balance_method='round_robin', dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', preferred_sampling_params=None, lora_paths=None, max_loras_per_batch=8, lora_backend='triton', attention_backend=None, sampling_backend='flashinfer', grammar_backend='xgrammar', mm_attention_backend=None, speculative_algorithm=None, speculative_draft_model_path=None, speculative_num_steps=None, speculative_eagle_topk=None, speculative_num_draft_tokens=None, speculative_accept_threshold_single=1.0, speculative_accept_threshold_acc=1.0, speculative_token_map=None, ep_size=1, enable_ep_moe=False, enable_deepep_moe=False, enable_flashinfer_moe=False, deepep_mode='auto', ep_num_redundant_experts=0, ep_dispatch_algorithm='static', init_expert_location='trivial', enable_eplb=False, eplb_algorithm='auto', eplb_rebalance_num_iterations=1000, eplb_rebalance_layers_per_chunk=None, expert_distribution_recorder_mode=None, expert_distribution_recorder_buffer_size=1000, enable_expert_distribution_metrics=False, deepep_config=None, moe_dense_tp_size=None, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, disable_radix_cache=False, cuda_graph_max_bs=None, cuda_graph_bs=None, disable_cuda_graph=True, disable_cuda_graph_padding=False, enable_profile_cuda_graph=False, enable_nccl_nvls=False, enable_tokenizer_batch_encode=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, enable_mscclpp=False, disable_overlap_schedule=False, disable_overlap_cg_plan=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_dp_lm_head=False, enable_two_batch_overlap=False, enable_torch_compile=False, torch_compile_max_bs=32, torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, num_continuous_decode_steps=1, delete_ckpt_after_loading=False, enable_memory_saver=False, allow_auto_truncate=False, enable_custom_logit_processor=False, enable_hierarchical_cache=False, hicache_ratio=2.0, hicache_size=0, hicache_write_policy='write_through_selective', flashinfer_mla_disable_ragged=False, disable_shared_experts_fusion=False, disable_chunked_prefix_cache=False, disable_fast_image_processor=False, enable_return_hidden_states=False, warmups=None, debug_tensor_dump_output_folder=None, debug_tensor_dump_input_file=None, debug_tensor_dump_inject=False, debug_tensor_dump_prefill_only=False, disaggregation_mode='null', disaggregation_transfer_backend='mooncake', disaggregation_bootstrap_port=8998, disaggregation_decode_tp=None, disaggregation_decode_dp=None, disaggregation_prefill_pp=1, disaggregation_ib_device=None, num_reserved_decode_tokens=512, pdlb_url=None, custom_weight_loader=[], weight_loader_disable_mmap=False)
[2025-06-30 02:31:45] Attention backend not set. Use fa3 backend by default.
[2025-06-30 02:31:45] Init torch distributed begin.
[2025-06-30 02:31:46] Init torch distributed ends. mem usage=0.00 GB
[2025-06-30 02:31:47] Load weight begin. avail mem=78.50 GB
[2025-06-30 02:31:47] Using model weights format ['*.safetensors']
[2025-06-30 02:31:47] No model.safetensors.index.json found in remote.
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.18it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.17it/s]

[2025-06-30 02:31:48] Load weight end. type=Qwen2ForCausalLM, dtype=torch.bfloat16, avail mem=77.52 GB, mem usage=0.98 GB.
[2025-06-30 02:31:48] KV Cache is allocated. #tokens: 20480, K size: 0.12 GB, V size: 0.12 GB
[2025-06-30 02:31:48] Memory pool end. avail mem=77.11 GB
[2025-06-30 02:31:48] max_total_num_tokens=20480, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=200, context_len=32768, available_gpu_mem=77.02 GB
[2025-06-30 02:31:49] INFO:     Started server process [1979243]
[2025-06-30 02:31:49] INFO:     Waiting for application startup.
[2025-06-30 02:31:49] INFO:     Application startup complete.
[2025-06-30 02:31:49] INFO:     Uvicorn running on http://0.0.0.0:36255 (Press CTRL+C to quit)
[2025-06-30 02:31:49] INFO:     127.0.0.1:60904 - "GET /v1/models HTTP/1.1" 200 OK
[2025-06-30 02:31:50] INFO:     127.0.0.1:60920 - "GET /get_model_info HTTP/1.1" 200 OK
[2025-06-30 02:31:50] Prefill batch. #new-seq: 1, #new-token: 6, #cached-token: 0, #token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-06-30 02:31:51] INFO:     127.0.0.1:60936 - "POST /generate HTTP/1.1" 200 OK
[2025-06-30 02:31:51] The server is fired up and ready to roll!

NOTE: Typically, the server runs in a separate terminal.
In this notebook, we run the server and notebook code together, so their outputs are combined.
To improve clarity, the server logs are displayed in the original black color, while the notebook outputs are highlighted in blue.
We are running those notebooks in a CI parallel environment, so the throughput is not representative of the actual performance.

Using cURL#

[2]:

import subprocess, json

curl_command = f"""
curl -s http://localhost:{port}/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{{"model": "qwen/qwen2.5-0.5b-instruct", "messages": [{{"role": "user", "content": "What is the capital of France?"}}]}}'
"""

response = json.loads(subprocess.check_output(curl_command, shell=True))
print_highlight(response)

[2025-06-30 02:31:54] Prefill batch. #new-seq: 1, #new-token: 36, #cached-token: 0, #token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-06-30 02:31:54] INFO:     127.0.0.1:50642 - "POST /v1/chat/completions HTTP/1.1" 200 OK

{'id': '2685e057d57a4e3fb1f241180d3f72e2', 'object': 'chat.completion', 'created': 1751250714, 'model': 'qwen/qwen2.5-0.5b-instruct', 'choices': [{'index': 0, 'message': {'role': 'assistant', 'content': 'The capital of France is Paris.', 'reasoning_content': None, 'tool_calls': None}, 'logprobs': None, 'finish_reason': 'stop', 'matched_stop': 151645}], 'usage': {'prompt_tokens': 36, 'total_tokens': 44, 'completion_tokens': 8, 'prompt_tokens_details': None}}

Using Python Requests#

[3]:

import requests

url = f"http://localhost:{port}/v1/chat/completions"

data = {
    "model": "qwen/qwen2.5-0.5b-instruct",
    "messages": [{"role": "user", "content": "What is the capital of France?"}],
}

response = requests.post(url, json=data)
print_highlight(response.json())

[2025-06-30 02:31:54] Prefill batch. #new-seq: 1, #new-token: 1, #cached-token: 35, #token: 35, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-06-30 02:31:54] INFO:     127.0.0.1:50646 - "POST /v1/chat/completions HTTP/1.1" 200 OK

{'id': 'dec0cdd97b1b4b0da0c68141569da478', 'object': 'chat.completion', 'created': 1751250714, 'model': 'qwen/qwen2.5-0.5b-instruct', 'choices': [{'index': 0, 'message': {'role': 'assistant', 'content': 'The capital of France is Paris.', 'reasoning_content': None, 'tool_calls': None}, 'logprobs': None, 'finish_reason': 'stop', 'matched_stop': 151645}], 'usage': {'prompt_tokens': 36, 'total_tokens': 44, 'completion_tokens': 8, 'prompt_tokens_details': None}}

Using OpenAI Python Client#

[4]:

import openai

client = openai.Client(base_url=f"http://127.0.0.1:{port}/v1", api_key="None")

response = client.chat.completions.create(
    model="qwen/qwen2.5-0.5b-instruct",
    messages=[
        {"role": "user", "content": "List 3 countries and their capitals."},
    ],
    temperature=0,
    max_tokens=64,
)
print_highlight(response)

[2025-06-30 02:31:55] Prefill batch. #new-seq: 1, #new-token: 13, #cached-token: 24, #token: 24, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-06-30 02:31:55] Decode batch. #running-req: 1, #token: 54, token usage: 0.00, cuda graph: False, gen throughput (token/s): 6.11, #queue-req: 0
[2025-06-30 02:31:55] INFO:     127.0.0.1:50648 - "POST /v1/chat/completions HTTP/1.1" 200 OK

ChatCompletion(id='d62d60374a694d85af0d7e48fb3589bb', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='Sure, here are three countries and their respective capitals:\n\n1. **United States** - Washington, D.C.\n2. **Canada** - Ottawa\n3. **Australia** - Canberra', refusal=None, role='assistant', annotations=None, audio=None, function_call=None, tool_calls=None, reasoning_content=None), matched_stop=151645)], created=1751250715, model='qwen/qwen2.5-0.5b-instruct', object='chat.completion', service_tier=None, system_fingerprint=None, usage=CompletionUsage(completion_tokens=39, prompt_tokens=37, total_tokens=76, completion_tokens_details=None, prompt_tokens_details=None))

Streaming#

[5]:

import openai

client = openai.Client(base_url=f"http://127.0.0.1:{port}/v1", api_key="None")

# Use stream=True for streaming responses
response = client.chat.completions.create(
    model="qwen/qwen2.5-0.5b-instruct",
    messages=[
        {"role": "user", "content": "List 3 countries and their capitals."},
    ],
    temperature=0,
    max_tokens=64,
    stream=True,
)

# Handle the streaming output
for chunk in response:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

[2025-06-30 02:31:55] INFO:     127.0.0.1:50654 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-06-30 02:31:55] Prefill batch. #new-seq: 1, #new-token: 1, #cached-token: 36, #token: 36, token usage: 0.00, #running-req: 0, #queue-req: 0
Sure, here are three countries and their respective capitals:

1. **United States**[2025-06-30 02:31:55] Decode batch. #running-req: 1, #token: 55, token usage: 0.00, cuda graph: False, gen throughput (token/s): 102.18, #queue-req: 0
 - Washington, D.C.
2. **Canada** - Ottawa
3. **Australia** - Canberra

Using Native Generation APIs#

You can also use the native /generate endpoint with requests, which provides more flexibility. An API reference is available at Sampling Parameters.

[6]:

import requests

response = requests.post(
    f"http://localhost:{port}/generate",
    json={
        "text": "The capital of France is",
        "sampling_params": {
            "temperature": 0,
            "max_new_tokens": 32,
        },
    },
)

print_highlight(response.json())

[2025-06-30 02:31:55] Prefill batch. #new-seq: 1, #new-token: 3, #cached-token: 2, #token: 2, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-06-30 02:31:55] Decode batch. #running-req: 1, #token: 24, token usage: 0.00, cuda graph: False, gen throughput (token/s): 118.65, #queue-req: 0
[2025-06-30 02:31:56] INFO:     127.0.0.1:50668 - "POST /generate HTTP/1.1" 200 OK

{'text': ' Paris. It is the largest city in Europe and the second largest city in the world. It is located in the south of France, on the banks of the', 'meta_info': {'id': '903fc4bee0cd40d2afa969cdafda21d9', 'finish_reason': {'type': 'length', 'length': 32}, 'prompt_tokens': 5, 'completion_tokens': 32, 'cached_tokens': 2, 'e2e_latency': 0.26262784004211426}}

Streaming#

[7]:

import requests, json

response = requests.post(
    f"http://localhost:{port}/generate",
    json={
        "text": "The capital of France is",
        "sampling_params": {
            "temperature": 0,
            "max_new_tokens": 32,
        },
        "stream": True,
    },
    stream=True,
)

prev = 0
for chunk in response.iter_lines(decode_unicode=False):
    chunk = chunk.decode("utf-8")
    if chunk and chunk.startswith("data:"):
        if chunk == "data: [DONE]":
            break
        data = json.loads(chunk[5:].strip("\n"))
        output = data["text"]
        print(output[prev:], end="", flush=True)
        prev = len(output)

[2025-06-30 02:31:56] INFO:     127.0.0.1:50678 - "POST /generate HTTP/1.1" 200 OK
[2025-06-30 02:31:56] Prefill batch. #new-seq: 1, #new-token: 1, #cached-token: 4, #token: 4, token usage: 0.00, #running-req: 0, #queue-req: 0
 Paris. It is the largest city in Europe and the second largest city in the world. It is located in the south of France[2025-06-30 02:31:56] Decode batch. #running-req: 1, #token: 32, token usage: 0.00, cuda graph: False, gen throughput (token/s): 118.43, #queue-req: 0
, on the banks of the

[8]:

terminate_process(server_process)

[2025-06-30 02:31:56] Child process unexpectedly failed with exitcode=9. pid=1979496
[2025-06-30 02:31:56] Child process unexpectedly failed with exitcode=9. pid=1979423

Sending Requests

Contents

Sending Requests#

Launch A Server#

Using cURL#

Using Python Requests#

Using OpenAI Python Client#

Streaming#

Using Native Generation APIs#

Streaming#