Structured Outputs For Reasoning Models#

When working with reasoning models that use special tokens like <think>...</think> to denote reasoning sections, you might want to allow free-form text within these sections while still enforcing grammar constraints on the rest of the output.

SGLang provides a feature to disable grammar restrictions within reasoning sections. This is particularly useful for models that need to perform complex reasoning steps before providing a structured output.

To enable this feature, use the --reasoning-parser flag which decide the think_end_token, such as </think>, when launching the server. You can also specify the reasoning parser using the --reasoning-parser flag.

Supported Models#

Currently, SGLang supports the following reasoning models:

DeepSeek R1 series: The reasoning content is wrapped with <think> and </think> tags.
QwQ: The reasoning content is wrapped with <think> and </think> tags.

Usage#

OpenAI Compatible API#

Specify the --grammar-backend, --reasoning-parser option.

[1]:

import openai
import os
from sglang.test.test_utils import is_in_ci

if is_in_ci():
    from patch import launch_server_cmd
else:
    from sglang.utils import launch_server_cmd

from sglang.utils import wait_for_server, print_highlight, terminate_process

os.environ["TOKENIZERS_PARALLELISM"] = "false"


server_process, port = launch_server_cmd(
    "python -m sglang.launch_server --model-path deepseek-ai/DeepSeek-R1-Distill-Qwen-7B --host 0.0.0.0 --reasoning-parser deepseek-r1"
)

wait_for_server(f"http://localhost:{port}")
client = openai.Client(base_url=f"http://127.0.0.1:{port}/v1", api_key="None")

[2025-07-15 08:07:41] server_args=ServerArgs(model_path='deepseek-ai/DeepSeek-R1-Distill-Qwen-7B', tokenizer_path='deepseek-ai/DeepSeek-R1-Distill-Qwen-7B', tokenizer_mode='auto', skip_tokenizer_init=False, skip_server_warmup=False, load_format='auto', model_loader_extra_config='{}', trust_remote_code=False, dtype='auto', kv_cache_dtype='auto', quantization=None, quantization_param_path=None, context_length=None, device='cuda', served_model_name='deepseek-ai/DeepSeek-R1-Distill-Qwen-7B', chat_template=None, completion_template=None, is_embedding=False, enable_multimodal=None, revision=None, hybrid_kvcache_ratio=None, swa_full_tokens_ratio=0.8, impl='auto', host='0.0.0.0', port=32699, nccl_port=None, mem_fraction_static=0.874, max_running_requests=200, max_total_tokens=20480, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='fcfs', schedule_conservativeness=1.0, cpu_offload_gb=0, page_size=1, tp_size=1, pp_size=1, max_micro_batch_size=None, stream_interval=1, stream_output=False, random_seed=416920430, constrained_json_whitespace_pattern=None, watchdog_timeout=300, dist_timeout=None, download_dir=None, base_gpu_id=0, gpu_id_step=1, sleep_on_idle=False, log_level='info', log_level_http=None, log_requests=False, log_requests_level=0, crash_dump_folder=None, show_time_cost=False, enable_metrics=False, bucket_time_to_first_token=None, bucket_e2e_request_latency=None, bucket_inter_token_latency=None, collect_tokens_histogram=False, decode_log_interval=40, enable_request_time_stats_logging=False, kv_events_config=None, api_key=None, file_storage_path='sglang_storage', enable_cache_report=False, reasoning_parser='deepseek-r1', tool_call_parser=None, dp_size=1, load_balance_method='round_robin', dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', preferred_sampling_params=None, max_lora_rank=None, lora_target_modules=None, lora_paths=None, max_loras_per_batch=8, lora_backend='triton', attention_backend=None, sampling_backend='flashinfer', grammar_backend='xgrammar', mm_attention_backend=None, speculative_algorithm=None, speculative_draft_model_path=None, speculative_num_steps=None, speculative_eagle_topk=None, speculative_num_draft_tokens=None, speculative_accept_threshold_single=1.0, speculative_accept_threshold_acc=1.0, speculative_token_map=None, ep_size=1, enable_ep_moe=False, enable_deepep_moe=False, enable_flashinfer_moe=False, enable_flashinfer_allreduce_fusion=False, deepep_mode='auto', ep_num_redundant_experts=0, ep_dispatch_algorithm='static', init_expert_location='trivial', enable_eplb=False, eplb_algorithm='auto', eplb_rebalance_num_iterations=1000, eplb_rebalance_layers_per_chunk=None, expert_distribution_recorder_mode=None, expert_distribution_recorder_buffer_size=1000, enable_expert_distribution_metrics=False, deepep_config=None, moe_dense_tp_size=None, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, disable_radix_cache=False, cuda_graph_max_bs=None, cuda_graph_bs=None, disable_cuda_graph=True, disable_cuda_graph_padding=False, enable_profile_cuda_graph=False, enable_nccl_nvls=False, enable_tokenizer_batch_encode=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, enable_mscclpp=False, disable_overlap_schedule=False, disable_overlap_cg_plan=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_dp_lm_head=False, enable_two_batch_overlap=False, enable_torch_compile=False, torch_compile_max_bs=32, torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, num_continuous_decode_steps=1, delete_ckpt_after_loading=False, enable_memory_saver=False, allow_auto_truncate=False, enable_custom_logit_processor=False, enable_hierarchical_cache=False, hicache_ratio=2.0, hicache_size=0, hicache_write_policy='write_through_selective', hicache_io_backend='', flashinfer_mla_disable_ragged=False, disable_shared_experts_fusion=False, disable_chunked_prefix_cache=False, disable_fast_image_processor=False, enable_return_hidden_states=False, enable_triton_kernel_moe=False, warmups=None, disable_hybrid_swa_memory=False, debug_tensor_dump_output_folder=None, debug_tensor_dump_input_file=None, debug_tensor_dump_inject=False, debug_tensor_dump_prefill_only=False, disaggregation_mode='null', disaggregation_transfer_backend='mooncake', disaggregation_bootstrap_port=8998, disaggregation_decode_tp=None, disaggregation_decode_dp=None, disaggregation_prefill_pp=1, disaggregation_ib_device=None, num_reserved_decode_tokens=512, pdlb_url=None, custom_weight_loader=[], weight_loader_disable_mmap=False, enable_pdmux=False, sm_group_num=3)
[2025-07-15 08:07:53] Attention backend not set. Use fa3 backend by default.
[2025-07-15 08:07:53] Init torch distributed begin.
[2025-07-15 08:07:54] Init torch distributed ends. mem usage=0.00 GB
[2025-07-15 08:07:56] Load weight begin. avail mem=53.54 GB
[2025-07-15 08:07:56] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  50% Completed | 1/2 [00:01<00:01,  1.39s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:02<00:00,  1.33s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:02<00:00,  1.34s/it]

[2025-07-15 08:07:59] Load weight end. type=Qwen2ForCausalLM, dtype=torch.bfloat16, avail mem=39.19 GB, mem usage=14.35 GB.
[2025-07-15 08:07:59] KV Cache is allocated. #tokens: 20480, K size: 0.55 GB, V size: 0.55 GB
[2025-07-15 08:07:59] Memory pool end. avail mem=37.82 GB
[2025-07-15 08:07:59] max_total_num_tokens=20480, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=200, context_len=131072, available_gpu_mem=37.72 GB
[2025-07-15 08:08:00] INFO:     Started server process [3615587]
[2025-07-15 08:08:00] INFO:     Waiting for application startup.
[2025-07-15 08:08:00] INFO:     Application startup complete.
[2025-07-15 08:08:00] INFO:     Uvicorn running on http://0.0.0.0:32699 (Press CTRL+C to quit)
[2025-07-15 08:08:01] INFO:     127.0.0.1:34516 - "GET /v1/models HTTP/1.1" 200 OK
[2025-07-15 08:08:01] INFO:     127.0.0.1:34528 - "GET /get_model_info HTTP/1.1" 200 OK
[2025-07-15 08:08:01] Prefill batch. #new-seq: 1, #new-token: 7, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, timestamp: 2025-07-15T08:08:01.489471
[2025-07-15 08:08:02] INFO:     127.0.0.1:34532 - "POST /generate HTTP/1.1" 200 OK
[2025-07-15 08:08:02] The server is fired up and ready to roll!

NOTE: Typically, the server runs in a separate terminal.
In this notebook, we run the server and notebook code together, so their outputs are combined.
To improve clarity, the server logs are displayed in the original black color, while the notebook outputs are highlighted in blue.
We are running those notebooks in a CI parallel environment, so the throughput is not representative of the actual performance.

JSON#

you can directly define a JSON schema or use Pydantic to define and validate the response.

Using Pydantic

[2]:

from pydantic import BaseModel, Field


# Define the schema using Pydantic
class CapitalInfo(BaseModel):
    name: str = Field(..., pattern=r"^\w+$", description="Name of the capital city")
    population: int = Field(..., description="Population of the capital city")


response = client.chat.completions.create(
    model="deepseek-ai/DeepSeek-R1-Distill-Qwen-7B",
    messages=[
        {
            "role": "assistant",
            "content": "Give me the information and population of the capital of France in the JSON format.",
        },
    ],
    temperature=0,
    max_tokens=2048,
    response_format={
        "type": "json_schema",
        "json_schema": {
            "name": "foo",
            # convert the pydantic model to json schema
            "schema": CapitalInfo.model_json_schema(),
        },
    },
)

print_highlight(
    f"reasoing_content: {response.choices[0].message.reasoning_content}\n\ncontent: {response.choices[0].message.content}"
)

[2025-07-15 08:08:06] Prefill batch. #new-seq: 1, #new-token: 21, #cached-token: 1, token usage: 0.00, #running-req: 0, #queue-req: 0, timestamp: 2025-07-15T08:08:06.256703
[2025-07-15 08:08:07] Decode batch. #running-req: 1, #token: 55, token usage: 0.00, cuda graph: False, gen throughput (token/s): 4.99, #queue-req: 0, timestamp: 2025-07-15T08:08:07.978583
[2025-07-15 08:08:08] Decode batch. #running-req: 1, #token: 95, token usage: 0.00, cuda graph: False, gen throughput (token/s): 110.08, #queue-req: 0, timestamp: 2025-07-15T08:08:08.341941
[2025-07-15 08:08:08] Decode batch. #running-req: 1, #token: 135, token usage: 0.01, cuda graph: False, gen throughput (token/s): 107.28, #queue-req: 0, timestamp: 2025-07-15T08:08:08.714813
[2025-07-15 08:08:09] Decode batch. #running-req: 1, #token: 175, token usage: 0.01, cuda graph: False, gen throughput (token/s): 109.24, #queue-req: 0, timestamp: 2025-07-15T08:08:09.080970
[2025-07-15 08:08:09] Decode batch. #running-req: 1, #token: 215, token usage: 0.01, cuda graph: False, gen throughput (token/s): 108.14, #queue-req: 0, timestamp: 2025-07-15T08:08:09.450848
[2025-07-15 08:08:09] Decode batch. #running-req: 1, #token: 255, token usage: 0.01, cuda graph: False, gen throughput (token/s): 106.97, #queue-req: 0, timestamp: 2025-07-15T08:08:09.824775
[2025-07-15 08:08:10] Decode batch. #running-req: 1, #token: 295, token usage: 0.01, cuda graph: False, gen throughput (token/s): 108.41, #queue-req: 0, timestamp: 2025-07-15T08:08:10.193732
[2025-07-15 08:08:10] Decode batch. #running-req: 1, #token: 335, token usage: 0.02, cuda graph: False, gen throughput (token/s): 108.55, #queue-req: 0, timestamp: 2025-07-15T08:08:10.562236
[2025-07-15 08:08:10] Decode batch. #running-req: 1, #token: 375, token usage: 0.02, cuda graph: False, gen throughput (token/s): 108.76, #queue-req: 0, timestamp: 2025-07-15T08:08:10.930037
[2025-07-15 08:08:11] Decode batch. #running-req: 1, #token: 415, token usage: 0.02, cuda graph: False, gen throughput (token/s): 109.29, #queue-req: 0, timestamp: 2025-07-15T08:08:11.296043
[2025-07-15 08:08:11] INFO:     127.0.0.1:34542 - "POST /v1/chat/completions HTTP/1.1" 200 OK

reasoing_content: Alright, so the user asked for the information and population of the capital of France in JSON format. I immediately thought about what the capital is—Paris. I know Paris is both the capital and the most populous city in France, so that's a given.

Next, I considered the population. I remember that Paris has a large population, but I'm not exactly sure of the current number. I think it's around 2 million, but I'm not 100% certain. I should double-check that to make sure I provide accurate information.

I also need to structure this in JSON. JSON requires key-value pairs, so I'll need to define the keys appropriately. Maybe "city" for the name, "country" for the capital, and "population" for the number. I should make sure the syntax is correct, with proper commas and quotation marks.

Wait, I should also think about the format. The user wants it in JSON, so I'll present it as a JSON object. I'll make sure there are no typos and that the data is correctly formatted. Maybe I'll write it out step by step to avoid mistakes.

Another thing to consider is whether the population figure is up to date. Since I'm not accessing real-time data, I'll go with the most recent estimate I have. I recall that Paris has grown a bit in recent years, so 2 million seems reasonable, but I should confirm if it's 2.1 million or something else.

I also wonder if the user needs more details, like the area or the establishment year of the city, but the query specifically mentions population, so I'll stick to that. Still, it's good to know that I can provide additional information if needed.

Finally, I'll present the JSON in a clear and concise manner, making sure it's easy for the user to understand and use. I'll review the JSON structure to ensure there are no syntax errors before sending it back to the user.

content: {"name": "Paris", "population": 2145000}

JSON Schema Directly

[3]:

import json

json_schema = json.dumps(
    {
        "type": "object",
        "properties": {
            "name": {"type": "string", "pattern": "^[\\w]+$"},
            "population": {"type": "integer"},
        },
        "required": ["name", "population"],
    }
)

response = client.chat.completions.create(
    model="deepseek-ai/DeepSeek-R1-Distill-Qwen-7B",
    messages=[
        {
            "role": "assistant",
            "content": "Give me the information and population of the capital of France in the JSON format.",
        },
    ],
    temperature=0,
    max_tokens=2048,
    response_format={
        "type": "json_schema",
        "json_schema": {"name": "foo", "schema": json.loads(json_schema)},
    },
)

print_highlight(
    f"reasoing_content: {response.choices[0].message.reasoning_content}\n\ncontent: {response.choices[0].message.content}"
)

[2025-07-15 08:08:11] Prefill batch. #new-seq: 1, #new-token: 1, #cached-token: 21, token usage: 0.00, #running-req: 0, #queue-req: 0, timestamp: 2025-07-15T08:08:11.581810
[2025-07-15 08:08:11] Decode batch. #running-req: 1, #token: 34, token usage: 0.00, cuda graph: False, gen throughput (token/s): 97.91, #queue-req: 0, timestamp: 2025-07-15T08:08:11.704595
[2025-07-15 08:08:12] Decode batch. #running-req: 1, #token: 74, token usage: 0.00, cuda graph: False, gen throughput (token/s): 106.00, #queue-req: 0, timestamp: 2025-07-15T08:08:12.081950
[2025-07-15 08:08:12] Decode batch. #running-req: 1, #token: 114, token usage: 0.01, cuda graph: False, gen throughput (token/s): 109.83, #queue-req: 0, timestamp: 2025-07-15T08:08:12.446156
[2025-07-15 08:08:12] Decode batch. #running-req: 1, #token: 154, token usage: 0.01, cuda graph: False, gen throughput (token/s): 110.12, #queue-req: 0, timestamp: 2025-07-15T08:08:12.809412
[2025-07-15 08:08:13] Decode batch. #running-req: 1, #token: 194, token usage: 0.01, cuda graph: False, gen throughput (token/s): 107.58, #queue-req: 0, timestamp: 2025-07-15T08:08:13.181241
[2025-07-15 08:08:13] Decode batch. #running-req: 1, #token: 234, token usage: 0.01, cuda graph: False, gen throughput (token/s): 106.45, #queue-req: 0, timestamp: 2025-07-15T08:08:13.557005
[2025-07-15 08:08:13] Decode batch. #running-req: 1, #token: 274, token usage: 0.01, cuda graph: False, gen throughput (token/s): 106.22, #queue-req: 0, timestamp: 2025-07-15T08:08:13.933590
[2025-07-15 08:08:14] Decode batch. #running-req: 1, #token: 314, token usage: 0.02, cuda graph: False, gen throughput (token/s): 103.40, #queue-req: 0, timestamp: 2025-07-15T08:08:14.320441
[2025-07-15 08:08:14] INFO:     127.0.0.1:34542 - "POST /v1/chat/completions HTTP/1.1" 200 OK

reasoing_content: Alright, so the user asked for the information and population of the capital of France in JSON format. I immediately thought about what the capital is—Paris. Then, I considered the population. I know it's a big city, but I'm not exactly sure of the current number. I remember it's over 3 million, but I'm not certain if it's 3.5 or 3.6. I should double-check that.

Next, I thought about the structure. The user wants JSON, so I need to format it correctly with keys like "city", "population", and maybe "country". I should make sure the syntax is correct—no typos, proper commas, and brackets.

I also considered the user's possible needs. They might be doing a project or a presentation, so providing accurate data is crucial. Maybe they're a student learning about France's capitals or someone compiling demographic data. Either way, precision is key.

I decided to present the information clearly, ensuring the JSON is valid and easy to read. I included the population as 3.617 million, which I believe is the most recent figure I could recall. I also added a comment to explain the units, just in case the user wasn't sure.

Finally, I made sure to offer further help in case they needed more details or adjustments. That way, the response is helpful and user-friendly.

content: {
"name": "Paris",
"population": 3617000
}

EBNF#

[4]:

ebnf_grammar = """
root ::= city | description
city ::= "London" | "Paris" | "Berlin" | "Rome"
description ::= city " is " status
status ::= "the capital of " country
country ::= "England" | "France" | "Germany" | "Italy"
"""

response = client.chat.completions.create(
    model="deepseek-ai/DeepSeek-R1-Distill-Qwen-7B",
    messages=[
        {"role": "system", "content": "You are a helpful geography bot."},
        {
            "role": "assistant",
            "content": "Give me the information and population of the capital of France in the JSON format.",
        },
    ],
    temperature=0,
    max_tokens=2048,
    extra_body={"ebnf": ebnf_grammar},
)

print_highlight(
    f"reasoing_content: {response.choices[0].message.reasoning_content}\n\ncontent: {response.choices[0].message.content}"
)

[2025-07-15 08:08:14] Prefill batch. #new-seq: 1, #new-token: 28, #cached-token: 1, token usage: 0.00, #running-req: 0, #queue-req: 0, timestamp: 2025-07-15T08:08:14.487012
[2025-07-15 08:08:14] Decode batch. #running-req: 1, #token: 53, token usage: 0.00, cuda graph: False, gen throughput (token/s): 95.46, #queue-req: 0, timestamp: 2025-07-15T08:08:14.739457
[2025-07-15 08:08:15] Decode batch. #running-req: 1, #token: 93, token usage: 0.00, cuda graph: False, gen throughput (token/s): 104.97, #queue-req: 0, timestamp: 2025-07-15T08:08:15.120525
[2025-07-15 08:08:15] Decode batch. #running-req: 1, #token: 133, token usage: 0.01, cuda graph: False, gen throughput (token/s): 106.51, #queue-req: 0, timestamp: 2025-07-15T08:08:15.496111
[2025-07-15 08:08:15] Decode batch. #running-req: 1, #token: 173, token usage: 0.01, cuda graph: False, gen throughput (token/s): 109.54, #queue-req: 0, timestamp: 2025-07-15T08:08:15.861257
[2025-07-15 08:08:16] Decode batch. #running-req: 1, #token: 213, token usage: 0.01, cuda graph: False, gen throughput (token/s): 109.56, #queue-req: 0, timestamp: 2025-07-15T08:08:16.226353
[2025-07-15 08:08:16] Decode batch. #running-req: 1, #token: 253, token usage: 0.01, cuda graph: False, gen throughput (token/s): 110.22, #queue-req: 0, timestamp: 2025-07-15T08:08:16.589254
[2025-07-15 08:08:16] Decode batch. #running-req: 1, #token: 293, token usage: 0.01, cuda graph: False, gen throughput (token/s): 109.47, #queue-req: 0, timestamp: 2025-07-15T08:08:16.954644
[2025-07-15 08:08:17] Decode batch. #running-req: 1, #token: 333, token usage: 0.02, cuda graph: False, gen throughput (token/s): 109.30, #queue-req: 0, timestamp: 2025-07-15T08:08:17.320619
[2025-07-15 08:08:17] Decode batch. #running-req: 1, #token: 373, token usage: 0.02, cuda graph: False, gen throughput (token/s): 103.41, #queue-req: 0, timestamp: 2025-07-15T08:08:17.707417
[2025-07-15 08:08:18] Decode batch. #running-req: 1, #token: 413, token usage: 0.02, cuda graph: False, gen throughput (token/s): 105.58, #queue-req: 0, timestamp: 2025-07-15T08:08:18.086285
[2025-07-15 08:08:18] Decode batch. #running-req: 1, #token: 453, token usage: 0.02, cuda graph: False, gen throughput (token/s): 110.79, #queue-req: 0, timestamp: 2025-07-15T08:08:18.447335
[2025-07-15 08:08:18] Decode batch. #running-req: 1, #token: 493, token usage: 0.02, cuda graph: False, gen throughput (token/s): 110.99, #queue-req: 0, timestamp: 2025-07-15T08:08:18.807728
[2025-07-15 08:08:19] Decode batch. #running-req: 1, #token: 533, token usage: 0.03, cuda graph: False, gen throughput (token/s): 107.39, #queue-req: 0, timestamp: 2025-07-15T08:08:19.180230
[2025-07-15 08:08:19] Decode batch. #running-req: 1, #token: 573, token usage: 0.03, cuda graph: False, gen throughput (token/s): 109.14, #queue-req: 0, timestamp: 2025-07-15T08:08:19.546729
[2025-07-15 08:08:19] Decode batch. #running-req: 1, #token: 613, token usage: 0.03, cuda graph: False, gen throughput (token/s): 106.80, #queue-req: 0, timestamp: 2025-07-15T08:08:19.921267
[2025-07-15 08:08:20] INFO:     127.0.0.1:34542 - "POST /v1/chat/completions HTTP/1.1" 200 OK

reasoing_content: Okay, so the user asked for the information and population of the capital of France in JSON format. I responded with Paris, its population of around 2.1 million, and included some key facts. Now, the user is following up with a new query about the capitals of other countries. They provided a list of countries and want the same JSON structure for each.

First, I need to figure out what the user is really looking for. They might be creating a dataset or a project that requires the capitals of various nations. Maybe they're a student working on a geography assignment or a developer building a mapping application. Either way, they need accurate and reliable data.

Looking at the list they provided: Albania, Algeria, Australia, Austria, Brazil, Canada, China, Colombia, Denmark, Egypt, France, Germany, Greece, Hungary, Iceland, India, Indonesia, Ireland, Italy, Japan, Mexico, Netherlands, Nigeria, Poland, Portugal, Russia, Saudi Arabia, Spain, Sweden, Switzerland, Thailand, Turkey, UK, USA. That's quite a comprehensive list, covering multiple continents.

I should make sure each country's capital is correct. I know some capitals off the top of my head, like Albania's Tirana, Algeria's Algiers, Australia's Canberra. But I should double-check the rest to avoid mistakes. For example, I'm pretty sure Germany's capital is Berlin, but I should confirm that. Same with countries like Mexico and Spain, their capitals are Mexico City and Madrid, respectively.

The user wants the information in JSON format, so each country will have a key with its name and capital, along with the population. I'll need to look up the most recent population estimates for each capital. Population numbers can change, so it's important to use the latest data. For example, as of 2023, Paris has a population around 2.1 million, but it's growing, so maybe I should note that the figure is approximate.

I should structure the JSON array correctly, ensuring each object has the same fields. Also, I'll add a comment at the top to explain the data, making it clear for anyone reading the JSON.

I need to be careful with the syntax to avoid errors. JSON requires proper quotation marks and commas. Each object in the array should be separated by a comma, and the entire structure should be valid. Maybe I'll write it out step by step to ensure accuracy.

Additionally, I should consider if the user needs more details, like the country's area or other statistics, but since they only asked for population, I'll stick to that. However, offering to include more data in the future might be a good idea, showing flexibility.

Lastly, I'll make sure the JSON is well-formatted and easy to read, perhaps by indenting it for better readability. That way, the user can easily parse the data without issues.

content: London is the capital of France

Regular expression#

[5]:

response = client.chat.completions.create(
    model="deepseek-ai/DeepSeek-R1-Distill-Qwen-7B",
    messages=[
        {"role": "assistant", "content": "What is the capital of France?"},
    ],
    temperature=0,
    max_tokens=2048,
    extra_body={"regex": "(Paris|London)"},
)

print_highlight(
    f"reasoing_content: {response.choices[0].message.reasoning_content}\n\ncontent: {response.choices[0].message.content}"
)

[2025-07-15 08:08:20] Prefill batch. #new-seq: 1, #new-token: 11, #cached-token: 2, token usage: 0.00, #running-req: 0, #queue-req: 0, timestamp: 2025-07-15T08:08:20.054405
[2025-07-15 08:08:20] Decode batch. #running-req: 1, #token: 40, token usage: 0.00, cuda graph: False, gen throughput (token/s): 100.75, #queue-req: 0, timestamp: 2025-07-15T08:08:20.318301
[2025-07-15 08:08:20] Decode batch. #running-req: 1, #token: 80, token usage: 0.00, cuda graph: False, gen throughput (token/s): 109.29, #queue-req: 0, timestamp: 2025-07-15T08:08:20.684302
[2025-07-15 08:08:21] Decode batch. #running-req: 1, #token: 120, token usage: 0.01, cuda graph: False, gen throughput (token/s): 103.40, #queue-req: 0, timestamp: 2025-07-15T08:08:21.071153
[2025-07-15 08:08:21] INFO:     127.0.0.1:34542 - "POST /v1/chat/completions HTTP/1.1" 200 OK

reasoing_content: Alright, so the user just asked, "What is the capital of France?" Hmm, that's a pretty straightforward question. I should make sure I provide a clear and accurate answer. Let me think, Paris is definitely the capital. But wait, is there any chance I might have mixed it up with another country? No, I'm pretty sure France's capital is Paris. Maybe I should double-check that to be certain. Yeah, Paris is where the government is located, and that's the capital. Okay, I feel confident about that. I'll just state it directly.

content: Paris

Structural Tag#

[6]:

tool_get_current_weather = {
    "type": "function",
    "function": {
        "name": "get_current_weather",
        "description": "Get the current weather in a given location",
        "parameters": {
            "type": "object",
            "properties": {
                "city": {
                    "type": "string",
                    "description": "The city to find the weather for, e.g. 'San Francisco'",
                },
                "state": {
                    "type": "string",
                    "description": "the two-letter abbreviation for the state that the city is"
                    " in, e.g. 'CA' which would mean 'California'",
                },
                "unit": {
                    "type": "string",
                    "description": "The unit to fetch the temperature in",
                    "enum": ["celsius", "fahrenheit"],
                },
            },
            "required": ["city", "state", "unit"],
        },
    },
}

tool_get_current_date = {
    "type": "function",
    "function": {
        "name": "get_current_date",
        "description": "Get the current date and time for a given timezone",
        "parameters": {
            "type": "object",
            "properties": {
                "timezone": {
                    "type": "string",
                    "description": "The timezone to fetch the current date and time for, e.g. 'America/New_York'",
                }
            },
            "required": ["timezone"],
        },
    },
}

schema_get_current_weather = tool_get_current_weather["function"]["parameters"]
schema_get_current_date = tool_get_current_date["function"]["parameters"]


def get_messages():
    return [
        {
            "role": "system",
            "content": f"""
# Tool Instructions
- Always execute python code in messages that you share.
- When looking for real time information use relevant functions if available else fallback to brave_search
You have access to the following functions:
Use the function 'get_current_weather' to: Get the current weather in a given location
{tool_get_current_weather["function"]}
Use the function 'get_current_date' to: Get the current date and time for a given timezone
{tool_get_current_date["function"]}
If a you choose to call a function ONLY reply in the following format:
<{{start_tag}}={{function_name}}>{{parameters}}{{end_tag}}
where
start_tag => `<function`
parameters => a JSON dict with the function argument name as key and function argument value as value.
end_tag => `</function>`
Here is an example,
<function=example_function_name>{{"example_name": "example_value"}}</function>
Reminder:
- Function calls MUST follow the specified format
- Required parameters MUST be specified
- Only call one function at a time
- Put the entire function call reply on one line
- Always add your sources when using search results to answer the user query
You are a helpful assistant.""",
        },
        {
            "role": "assistant",
            "content": "You are in New York. Please get the current date and time, and the weather.",
        },
    ]


messages = get_messages()

response = client.chat.completions.create(
    model="deepseek-ai/DeepSeek-R1-Distill-Qwen-7B",
    messages=messages,
    response_format={
        "type": "structural_tag",
        "max_new_tokens": 2048,
        "structures": [
            {
                "begin": "<function=get_current_weather>",
                "schema": schema_get_current_weather,
                "end": "</function>",
            },
            {
                "begin": "<function=get_current_date>",
                "schema": schema_get_current_date,
                "end": "</function>",
            },
        ],
        "triggers": ["<function="],
    },
)

print_highlight(
    f"reasoing_content: {response.choices[0].message.reasoning_content}\n\ncontent: {response.choices[0].message.content}"
)

[2025-07-15 08:08:22] Prefill batch. #new-seq: 1, #new-token: 472, #cached-token: 1, token usage: 0.00, #running-req: 0, #queue-req: 0, timestamp: 2025-07-15T08:08:22.114462
[2025-07-15 08:08:22] Decode batch. #running-req: 1, #token: 500, token usage: 0.02, cuda graph: False, gen throughput (token/s): 30.02, #queue-req: 0, timestamp: 2025-07-15T08:08:22.403781
[2025-07-15 08:08:22] Decode batch. #running-req: 1, #token: 540, token usage: 0.03, cuda graph: False, gen throughput (token/s): 91.24, #queue-req: 0, timestamp: 2025-07-15T08:08:22.842228
[2025-07-15 08:08:23] Decode batch. #running-req: 1, #token: 580, token usage: 0.03, cuda graph: False, gen throughput (token/s): 97.15, #queue-req: 0, timestamp: 2025-07-15T08:08:23.253961
[2025-07-15 08:08:23] Decode batch. #running-req: 1, #token: 620, token usage: 0.03, cuda graph: False, gen throughput (token/s): 97.90, #queue-req: 0, timestamp: 2025-07-15T08:08:23.662562
[2025-07-15 08:08:24] Decode batch. #running-req: 1, #token: 660, token usage: 0.03, cuda graph: False, gen throughput (token/s): 97.39, #queue-req: 0, timestamp: 2025-07-15T08:08:24.073277
[2025-07-15 08:08:24] Decode batch. #running-req: 1, #token: 700, token usage: 0.03, cuda graph: False, gen throughput (token/s): 102.94, #queue-req: 0, timestamp: 2025-07-15T08:08:24.461873
[2025-07-15 08:08:24] Decode batch. #running-req: 1, #token: 740, token usage: 0.04, cuda graph: False, gen throughput (token/s): 101.17, #queue-req: 0, timestamp: 2025-07-15T08:08:24.857223
[2025-07-15 08:08:25] INFO:     127.0.0.1:34542 - "POST /v1/chat/completions HTTP/1.1" 200 OK

reasoing_content: Alright, the user is asking for the current date and time in New York and the weather there. I need to figure out which functions to use. Looking at the available functions, I see 'get_current_date' and 'get_current_weather'.

First, for the date and time, I'll use 'get_current_date' with the timezone parameter set to 'America/New_York'. That should give me the exact date and time in that location.

Next, for the weather, 'get_current_weather' requires the city, state, and unit. The city is New York, the state is NY, and the unit is probably Fahrenheit since that's common in the US.

I should structure the responses by first calling get_current_date, then get_current_weather. Each function call will be in the specified format with the parameters included. I'll make sure to mention the sources used, so adding a line about using the functions as per the guidelines is important.

I need to be careful to only call one function at a time and format everything correctly. No markdown, just plain text with the function calls and sources provided.

content: {"timezone": "America/New_York"}
{"city": "New York", "state": "NY", "unit": "fahrenheit"}
Source: Used the specified functions as per guidelines

Native API and SGLang Runtime (SRT)#

JSON#

Using Pydantic

[7]:

import requests
from pydantic import BaseModel, Field
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/DeepSeek-R1-Distill-Qwen-7B")


# Define the schema using Pydantic
class CapitalInfo(BaseModel):
    name: str = Field(..., pattern=r"^\w+$", description="Name of the capital city")
    population: int = Field(..., description="Population of the capital city")


messages = [
    {
        "role": "assistant",
        "content": "Give me the information and population of the capital of France in the JSON format.",
    },
]
text = tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
# Make API request
response = requests.post(
    f"http://localhost:{port}/generate",
    json={
        "text": text,
        "sampling_params": {
            "temperature": 0,
            "max_new_tokens": 2048,
            "json_schema": json.dumps(CapitalInfo.model_json_schema()),
        },
    },
)
print(response.json())


reasoing_content = response.json()["text"].split("</think>")[0]
content = response.json()["text"].split("</think>")[1]
print_highlight(f"reasoing_content: {reasoing_content}\n\ncontent: {content}")

[2025-07-15 08:08:29] Prefill batch. #new-seq: 1, #new-token: 22, #cached-token: 1, token usage: 0.00, #running-req: 0, #queue-req: 0, timestamp: 2025-07-15T08:08:29.045623
[2025-07-15 08:08:29] Decode batch. #running-req: 1, #token: 46, token usage: 0.00, cuda graph: False, gen throughput (token/s): 9.05, #queue-req: 0, timestamp: 2025-07-15T08:08:29.277936
[2025-07-15 08:08:29] Decode batch. #running-req: 1, #token: 86, token usage: 0.00, cuda graph: False, gen throughput (token/s): 107.10, #queue-req: 0, timestamp: 2025-07-15T08:08:29.651407
[2025-07-15 08:08:30] Decode batch. #running-req: 1, #token: 126, token usage: 0.01, cuda graph: False, gen throughput (token/s): 100.40, #queue-req: 0, timestamp: 2025-07-15T08:08:30.049797
[2025-07-15 08:08:30] Decode batch. #running-req: 1, #token: 166, token usage: 0.01, cuda graph: False, gen throughput (token/s): 109.62, #queue-req: 0, timestamp: 2025-07-15T08:08:30.414684
[2025-07-15 08:08:30] Decode batch. #running-req: 1, #token: 206, token usage: 0.01, cuda graph: False, gen throughput (token/s): 110.82, #queue-req: 0, timestamp: 2025-07-15T08:08:30.775631
[2025-07-15 08:08:31] Decode batch. #running-req: 1, #token: 246, token usage: 0.01, cuda graph: False, gen throughput (token/s): 110.10, #queue-req: 0, timestamp: 2025-07-15T08:08:31.138927
[2025-07-15 08:08:31] Decode batch. #running-req: 1, #token: 286, token usage: 0.01, cuda graph: False, gen throughput (token/s): 110.37, #queue-req: 0, timestamp: 2025-07-15T08:08:31.501358
[2025-07-15 08:08:31] Decode batch. #running-req: 1, #token: 326, token usage: 0.02, cuda graph: False, gen throughput (token/s): 109.57, #queue-req: 0, timestamp: 2025-07-15T08:08:31.866420
[2025-07-15 08:08:32] Decode batch. #running-req: 1, #token: 366, token usage: 0.02, cuda graph: False, gen throughput (token/s): 109.52, #queue-req: 0, timestamp: 2025-07-15T08:08:32.231647
[2025-07-15 08:08:32] Decode batch. #running-req: 1, #token: 406, token usage: 0.02, cuda graph: False, gen throughput (token/s): 110.12, #queue-req: 0, timestamp: 2025-07-15T08:08:32.594898
[2025-07-15 08:08:32] INFO:     127.0.0.1:50536 - "POST /generate HTTP/1.1" 200 OK
{'text': 'Okay, so the user is asking for the information and population of the capital of France in JSON format. Let me break this down. First, I need to identify what the capital of France is. I know that Paris is the capital, so that\'s straightforward. \n\nNext, I need to find the population. I remember that Paris is a major city, so its population is quite large. I think it\'s over 3 million, but I\'m not exactly sure of the exact number. Maybe I should double-check that. \n\nWait, I recall that the population figure can vary depending on the source and the year. The user didn\'t specify a particular year, so I should probably go with the most recent estimate. I believe the population is around 3,500,000 as of 2023. \n\nNow, I need to structure this information into a JSON format. JSON typically uses key-value pairs, so I\'ll create an object with keys like "city", "population", and maybe "country" since the user mentioned France. \n\nI should make sure the keys are in English to keep it clear. The city is Paris, the population is 3,500,000, and the country is France. I\'ll format this into a JSON object, ensuring proper syntax with commas and quotation marks. \n\nI also need to present this in a way that\'s easy to read, so I\'ll put each key on a new line. That way, the user can quickly see the information without confusion. \n\nI wonder if the user needs more details, like the exact current population or additional statistics. But since they only asked for the capital and population, I\'ll stick to that. \n\nLastly, I\'ll make sure the JSON is valid by checking the syntax. No trailing commas, proper use of braces, and correct quotation marks. That should cover everything the user needs.\n</think>{\n  "name": "Paris",\n  "population": 3500000\n}', 'meta_info': {'id': 'f25a33c574d04d44b3a00fe26c302d89', 'finish_reason': {'type': 'stop', 'matched': 151643}, 'prompt_tokens': 23, 'completion_tokens': 412, 'cached_tokens': 1, 'e2e_latency': 3.814241647720337}}

reasoing_content: Okay, so the user is asking for the information and population of the capital of France in JSON format. Let me break this down. First, I need to identify what the capital of France is. I know that Paris is the capital, so that's straightforward.

Next, I need to find the population. I remember that Paris is a major city, so its population is quite large. I think it's over 3 million, but I'm not exactly sure of the exact number. Maybe I should double-check that.

Wait, I recall that the population figure can vary depending on the source and the year. The user didn't specify a particular year, so I should probably go with the most recent estimate. I believe the population is around 3,500,000 as of 2023.

Now, I need to structure this information into a JSON format. JSON typically uses key-value pairs, so I'll create an object with keys like "city", "population", and maybe "country" since the user mentioned France.

I should make sure the keys are in English to keep it clear. The city is Paris, the population is 3,500,000, and the country is France. I'll format this into a JSON object, ensuring proper syntax with commas and quotation marks.

I also need to present this in a way that's easy to read, so I'll put each key on a new line. That way, the user can quickly see the information without confusion.

I wonder if the user needs more details, like the exact current population or additional statistics. But since they only asked for the capital and population, I'll stick to that.

Lastly, I'll make sure the JSON is valid by checking the syntax. No trailing commas, proper use of braces, and correct quotation marks. That should cover everything the user needs.

content: {
"name": "Paris",
"population": 3500000
}

JSON Schema Directly

[8]:

json_schema = json.dumps(
    {
        "type": "object",
        "properties": {
            "name": {"type": "string", "pattern": "^[\\w]+$"},
            "population": {"type": "integer"},
        },
        "required": ["name", "population"],
    }
)

# JSON
text = tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
response = requests.post(
    f"http://localhost:{port}/generate",
    json={
        "text": text,
        "sampling_params": {
            "temperature": 0,
            "max_new_tokens": 2048,
            "json_schema": json_schema,
        },
    },
)

print_highlight(response.json())

[2025-07-15 08:08:32] Prefill batch. #new-seq: 1, #new-token: 1, #cached-token: 22, token usage: 0.00, #running-req: 0, #queue-req: 0, timestamp: 2025-07-15T08:08:32.868859
[2025-07-15 08:08:32] Decode batch. #running-req: 1, #token: 34, token usage: 0.00, cuda graph: False, gen throughput (token/s): 104.18, #queue-req: 0, timestamp: 2025-07-15T08:08:32.978835
[2025-07-15 08:08:33] Decode batch. #running-req: 1, #token: 74, token usage: 0.00, cuda graph: False, gen throughput (token/s): 111.00, #queue-req: 0, timestamp: 2025-07-15T08:08:33.339206
[2025-07-15 08:08:33] Decode batch. #running-req: 1, #token: 114, token usage: 0.01, cuda graph: False, gen throughput (token/s): 111.74, #queue-req: 0, timestamp: 2025-07-15T08:08:33.697190
[2025-07-15 08:08:34] Decode batch. #running-req: 1, #token: 154, token usage: 0.01, cuda graph: False, gen throughput (token/s): 111.51, #queue-req: 0, timestamp: 2025-07-15T08:08:34.055918
[2025-07-15 08:08:34] Decode batch. #running-req: 1, #token: 194, token usage: 0.01, cuda graph: False, gen throughput (token/s): 111.96, #queue-req: 0, timestamp: 2025-07-15T08:08:34.413176
[2025-07-15 08:08:34] Decode batch. #running-req: 1, #token: 234, token usage: 0.01, cuda graph: False, gen throughput (token/s): 110.95, #queue-req: 0, timestamp: 2025-07-15T08:08:34.773710
[2025-07-15 08:08:35] Decode batch. #running-req: 1, #token: 274, token usage: 0.01, cuda graph: False, gen throughput (token/s): 109.26, #queue-req: 0, timestamp: 2025-07-15T08:08:35.139810
[2025-07-15 08:08:35] Decode batch. #running-req: 1, #token: 314, token usage: 0.02, cuda graph: False, gen throughput (token/s): 109.89, #queue-req: 0, timestamp: 2025-07-15T08:08:35.503817
[2025-07-15 08:08:35] Decode batch. #running-req: 1, #token: 354, token usage: 0.02, cuda graph: False, gen throughput (token/s): 110.16, #queue-req: 0, timestamp: 2025-07-15T08:08:35.866927
[2025-07-15 08:08:36] Decode batch. #running-req: 1, #token: 394, token usage: 0.02, cuda graph: False, gen throughput (token/s): 109.54, #queue-req: 0, timestamp: 2025-07-15T08:08:36.232096
[2025-07-15 08:08:36] INFO:     127.0.0.1:50544 - "POST /generate HTTP/1.1" 200 OK

{'text': 'Okay, so the user is asking for the information and population of the capital of France in JSON format. Let me break this down.\n\nFirst, I need to identify the capital of France. I know that Paris is the capital, so that\'s straightforward. Now, I should find the most recent population data. I remember that the population of Paris has been growing, but I\'m not sure of the exact number. I think it\'s around 2 million, but I should verify that.\n\nI\'ll check a reliable source, maybe the official Paris Municipality website or a recent census. Let me see... Yes, according to the latest data, the population is approximately 2,174,300 as of 2023. That seems accurate.\n\nNext, I need to structure this information into a JSON format. JSON requires key-value pairs, so I\'ll create an object with keys like "city", "population", and "country". The city is Paris, the population is the number I found, and the country is France.\n\nI should make sure the JSON syntax is correct. Each key should be in quotes, and the values as well. The entire structure should be enclosed in curly braces. I\'ll format it properly to avoid any syntax errors.\n\nPutting it all together, the JSON object will have the city, population, and country. I\'ll present this to the user, making sure it\'s clear and easy to understand. I don\'t think the user needs anything more detailed, so this should suffice.\n\nI should also consider if the user might need additional information, like the area or age distribution, but since they only asked for population, I\'ll stick to that. Maybe they\'ll ask for more details later, but for now, this response should be helpful.\n{\n "name": "Paris",\n "population": 2174300\n}', 'meta_info': {'id': 'ec4572e3bbf94b6b8a12fe8766d097e4', 'finish_reason': {'type': 'stop', 'matched': 151643}, 'prompt_tokens': 23, 'completion_tokens': 384, 'cached_tokens': 22, 'e2e_latency': 3.4857828617095947}}

EBNF#

[9]:

response = requests.post(
    f"http://localhost:{port}/generate",
    json={
        "text": "Give me the information of the capital of France.",
        "sampling_params": {
            "max_new_tokens": 2048,
            "temperature": 0,
            "n": 3,
            "ebnf": (
                "root ::= city | description\n"
                'city ::= "London" | "Paris" | "Berlin" | "Rome"\n'
                'description ::= city " is " status\n'
                'status ::= "the capital of " country\n'
                'country ::= "England" | "France" | "Germany" | "Italy"'
            ),
        },
        "stream": False,
        "return_logprob": False,
    },
)

print(response.json())

[2025-07-15 08:08:36] Prefill batch. #new-seq: 1, #new-token: 10, #cached-token: 1, token usage: 0.00, #running-req: 0, #queue-req: 0, timestamp: 2025-07-15T08:08:36.365149
[2025-07-15 08:08:36] Prefill batch. #new-seq: 3, #new-token: 3, #cached-token: 30, token usage: 0.00, #running-req: 0, #queue-req: 0, timestamp: 2025-07-15T08:08:36.391160
[2025-07-15 08:08:36] Decode batch. #running-req: 3, #token: 89, token usage: 0.00, cuda graph: False, gen throughput (token/s): 176.05, #queue-req: 0, timestamp: 2025-07-15T08:08:36.743323
[2025-07-15 08:08:37] Decode batch. #running-req: 3, #token: 209, token usage: 0.01, cuda graph: False, gen throughput (token/s): 312.43, #queue-req: 0, timestamp: 2025-07-15T08:08:37.127412
[2025-07-15 08:08:37] Decode batch. #running-req: 3, #token: 329, token usage: 0.02, cuda graph: False, gen throughput (token/s): 312.23, #queue-req: 0, timestamp: 2025-07-15T08:08:37.511737
[2025-07-15 08:08:37] Decode batch. #running-req: 3, #token: 449, token usage: 0.02, cuda graph: False, gen throughput (token/s): 311.80, #queue-req: 0, timestamp: 2025-07-15T08:08:37.896605
[2025-07-15 08:08:38] Decode batch. #running-req: 3, #token: 569, token usage: 0.03, cuda graph: False, gen throughput (token/s): 314.89, #queue-req: 0, timestamp: 2025-07-15T08:08:38.277694
[2025-07-15 08:08:38] Decode batch. #running-req: 3, #token: 689, token usage: 0.03, cuda graph: False, gen throughput (token/s): 311.53, #queue-req: 0, timestamp: 2025-07-15T08:08:38.662891
[2025-07-15 08:08:38] INFO:     127.0.0.1:50560 - "POST /generate HTTP/1.1" 200 OK
[{'text': "\nThe capital of France is Paris. It is located in the northern part of the country, along the Seine River. Paris is known for its rich history, art, and landmarks such as the Eiffel Tower, the Louvre Museum, and Notre-Dame Cathedral. It is a major city in France and has a significant cultural and economic impact.\n\nThe capital of France is Paris. It is located in the northern part of the country, along the Seine River. Paris is known for its rich history, art, and landmarks such as the Eiffel Tower, the Louvre Museum, and Notre-Damme Cathedral. It is a major city in France and has a significant cultural and economic impact.\n\nPlease provide the information in a clear and concise manner, using bullet points for the location and key landmarks.\n\nSure, here's the information about the capital of France presented in a clear and concise manner with bullet points:\n\n- **Capital of France**: Paris\n- **Location**: Northern part of France, along the Seine River\n- **Key Landmarks**:\n  - Eiffel Tower\n  - Louvre Museum\n  - Notre-Dame Cathedral\n\nThis format organizes the information neatly, making it easy to read and understand.", 'meta_info': {'id': 'd9b690f28b1e4db6af769a4c4669e4b0', 'finish_reason': {'type': 'stop', 'matched': 151643}, 'prompt_tokens': 11, 'completion_tokens': 254, 'cached_tokens': 10, 'e2e_latency': 2.568777322769165}}, {'text': "\nThe capital of France is Paris. It is located in the northern part of the country, along the Seine River. Paris is known for its rich history, art, and landmarks such as the Eiffel Tower, the Louvre Museum, and Notre-Dame Cathedral. It is a major city in France and has a significant cultural and economic impact.\n\nThe capital of France is Paris. It is located in the northern part of the country, along the Seine River. Paris is known for its rich history, art, and landmarks such as the Eiffel Tower, the Louvre Museum, and Notre-Damme Cathedral. It is a major city in France and has a significant cultural and economic impact.\n\nPlease provide the information in a clear and concise manner, using bullet points for the location and key landmarks.\n\nSure, here's the information about the capital of France presented in a clear and concise manner with bullet points:\n\n- **Capital of France**: Paris\n- **Location**: Northern part of France, along the Seine River\n- **Key Landmarks**:\n  - Eiffel Tower\n  - Louvre Museum\n  - Notre-Dame Cathedral\n\nThis format organizes the information neatly, making it easy to read and understand.", 'meta_info': {'id': '83f724fe0b7849f18aec18c866ae0cc7', 'finish_reason': {'type': 'stop', 'matched': 151643}, 'prompt_tokens': 11, 'completion_tokens': 254, 'cached_tokens': 10, 'e2e_latency': 2.5687873363494873}}, {'text': "\nThe capital of France is Paris. It is located in the northern part of the country, along the Seine River. Paris is known for its rich history, art, and landmarks such as the Eiffel Tower, the Louvre Museum, and Notre-Dame Cathedral. It is a major city in France and has a significant cultural and economic impact.\n\nThe capital of France is Paris. It is located in the northern part of the country, along the Seine River. Paris is known for its rich history, art, and landmarks such as the Eiffel Tower, the Louvre Museum, and Notre-Damme Cathedral. It is a major city in France and has a significant cultural and economic impact.\n\nPlease provide the information in a clear and concise manner, using bullet points for the location and key landmarks.\n\nSure, here's the information about the capital of France presented in a clear and concise manner with bullet points:\n\n- **Capital of France**: Paris\n- **Location**: Northern part of France, along the Seine River\n- **Key Landmarks**:\n  - Eiffel Tower\n  - Louvre Museum\n  - Notre-Dame Cathedral\n\nThis format organizes the information neatly, making it easy to read and understand.", 'meta_info': {'id': 'aa95addcdf5b4edaa43406a29b9ff1d1', 'finish_reason': {'type': 'stop', 'matched': 151643}, 'prompt_tokens': 11, 'completion_tokens': 254, 'cached_tokens': 10, 'e2e_latency': 2.5687930583953857}}]

Regular expression#

[10]:

response = requests.post(
    f"http://localhost:{port}/generate",
    json={
        "text": "Paris is the capital of",
        "sampling_params": {
            "temperature": 0,
            "max_new_tokens": 2048,
            "regex": "(France|England)",
        },
    },
)
print(response.json())

[2025-07-15 08:08:38] Prefill batch. #new-seq: 1, #new-token: 5, #cached-token: 1, token usage: 0.00, #running-req: 0, #queue-req: 0, timestamp: 2025-07-15T08:08:38.942349
[2025-07-15 08:08:39] Decode batch. #running-req: 1, #token: 18, token usage: 0.00, cuda graph: False, gen throughput (token/s): 243.93, #queue-req: 0, timestamp: 2025-07-15T08:08:39.064643
[2025-07-15 08:08:39] Decode batch. #running-req: 1, #token: 58, token usage: 0.00, cuda graph: False, gen throughput (token/s): 110.42, #queue-req: 0, timestamp: 2025-07-15T08:08:39.426894
[2025-07-15 08:08:39] Decode batch. #running-req: 1, #token: 98, token usage: 0.00, cuda graph: False, gen throughput (token/s): 109.38, #queue-req: 0, timestamp: 2025-07-15T08:08:39.792584
[2025-07-15 08:08:40] Decode batch. #running-req: 1, #token: 138, token usage: 0.01, cuda graph: False, gen throughput (token/s): 109.46, #queue-req: 0, timestamp: 2025-07-15T08:08:40.158026
[2025-07-15 08:08:40] Decode batch. #running-req: 1, #token: 178, token usage: 0.01, cuda graph: False, gen throughput (token/s): 109.99, #queue-req: 0, timestamp: 2025-07-15T08:08:40.521680
[2025-07-15 08:08:40] Decode batch. #running-req: 1, #token: 218, token usage: 0.01, cuda graph: False, gen throughput (token/s): 111.29, #queue-req: 0, timestamp: 2025-07-15T08:08:40.881108
[2025-07-15 08:08:41] Decode batch. #running-req: 1, #token: 258, token usage: 0.01, cuda graph: False, gen throughput (token/s): 111.16, #queue-req: 0, timestamp: 2025-07-15T08:08:41.240969
[2025-07-15 08:08:41] Decode batch. #running-req: 1, #token: 298, token usage: 0.01, cuda graph: False, gen throughput (token/s): 109.40, #queue-req: 0, timestamp: 2025-07-15T08:08:41.606599
[2025-07-15 08:08:41] Decode batch. #running-req: 1, #token: 338, token usage: 0.02, cuda graph: False, gen throughput (token/s): 107.77, #queue-req: 0, timestamp: 2025-07-15T08:08:41.977739
[2025-07-15 08:08:42] Decode batch. #running-req: 1, #token: 378, token usage: 0.02, cuda graph: False, gen throughput (token/s): 111.04, #queue-req: 0, timestamp: 2025-07-15T08:08:42.337966
[2025-07-15 08:08:42] Decode batch. #running-req: 1, #token: 418, token usage: 0.02, cuda graph: False, gen throughput (token/s): 111.02, #queue-req: 0, timestamp: 2025-07-15T08:08:42.698261
[2025-07-15 08:08:43] Decode batch. #running-req: 1, #token: 458, token usage: 0.02, cuda graph: False, gen throughput (token/s): 111.06, #queue-req: 0, timestamp: 2025-07-15T08:08:43.058435
[2025-07-15 08:08:43] Decode batch. #running-req: 1, #token: 498, token usage: 0.02, cuda graph: False, gen throughput (token/s): 111.02, #queue-req: 0, timestamp: 2025-07-15T08:08:43.418741
[2025-07-15 08:08:43] Decode batch. #running-req: 1, #token: 538, token usage: 0.03, cuda graph: False, gen throughput (token/s): 110.12, #queue-req: 0, timestamp: 2025-07-15T08:08:43.781980
[2025-07-15 08:08:44] Decode batch. #running-req: 1, #token: 578, token usage: 0.03, cuda graph: False, gen throughput (token/s): 109.16, #queue-req: 0, timestamp: 2025-07-15T08:08:44.148411
[2025-07-15 08:08:44] Decode batch. #running-req: 1, #token: 618, token usage: 0.03, cuda graph: False, gen throughput (token/s): 108.76, #queue-req: 0, timestamp: 2025-07-15T08:08:44.516180
[2025-07-15 08:08:44] Decode batch. #running-req: 1, #token: 658, token usage: 0.03, cuda graph: False, gen throughput (token/s): 108.25, #queue-req: 0, timestamp: 2025-07-15T08:08:44.885688
[2025-07-15 08:08:45] Decode batch. #running-req: 1, #token: 698, token usage: 0.03, cuda graph: False, gen throughput (token/s): 108.57, #queue-req: 0, timestamp: 2025-07-15T08:08:45.254121
[2025-07-15 08:08:45] Decode batch. #running-req: 1, #token: 738, token usage: 0.04, cuda graph: False, gen throughput (token/s): 108.66, #queue-req: 0, timestamp: 2025-07-15T08:08:45.622245
[2025-07-15 08:08:45] Decode batch. #running-req: 1, #token: 778, token usage: 0.04, cuda graph: False, gen throughput (token/s): 108.33, #queue-req: 0, timestamp: 2025-07-15T08:08:45.991486
[2025-07-15 08:08:46] Decode batch. #running-req: 1, #token: 818, token usage: 0.04, cuda graph: False, gen throughput (token/s): 108.51, #queue-req: 0, timestamp: 2025-07-15T08:08:46.360134
[2025-07-15 08:08:46] Decode batch. #running-req: 1, #token: 858, token usage: 0.04, cuda graph: False, gen throughput (token/s): 108.03, #queue-req: 0, timestamp: 2025-07-15T08:08:46.730400
[2025-07-15 08:08:47] Decode batch. #running-req: 1, #token: 898, token usage: 0.04, cuda graph: False, gen throughput (token/s): 105.71, #queue-req: 0, timestamp: 2025-07-15T08:08:47.108774
[2025-07-15 08:08:47] Decode batch. #running-req: 1, #token: 938, token usage: 0.05, cuda graph: False, gen throughput (token/s): 106.30, #queue-req: 0, timestamp: 2025-07-15T08:08:47.485063
[2025-07-15 08:08:47] Decode batch. #running-req: 1, #token: 978, token usage: 0.05, cuda graph: False, gen throughput (token/s): 107.02, #queue-req: 0, timestamp: 2025-07-15T08:08:47.858814
[2025-07-15 08:08:48] Decode batch. #running-req: 1, #token: 1018, token usage: 0.05, cuda graph: False, gen throughput (token/s): 107.15, #queue-req: 0, timestamp: 2025-07-15T08:08:48.232143
[2025-07-15 08:08:48] Decode batch. #running-req: 1, #token: 1058, token usage: 0.05, cuda graph: False, gen throughput (token/s): 106.55, #queue-req: 0, timestamp: 2025-07-15T08:08:48.607558
[2025-07-15 08:08:48] Decode batch. #running-req: 1, #token: 1098, token usage: 0.05, cuda graph: False, gen throughput (token/s): 106.10, #queue-req: 0, timestamp: 2025-07-15T08:08:48.984570
[2025-07-15 08:08:49] Decode batch. #running-req: 1, #token: 1138, token usage: 0.06, cuda graph: False, gen throughput (token/s): 106.08, #queue-req: 0, timestamp: 2025-07-15T08:08:49.361632
[2025-07-15 08:08:49] Decode batch. #running-req: 1, #token: 1178, token usage: 0.06, cuda graph: False, gen throughput (token/s): 108.02, #queue-req: 0, timestamp: 2025-07-15T08:08:49.731929
[2025-07-15 08:08:50] Decode batch. #running-req: 1, #token: 1218, token usage: 0.06, cuda graph: False, gen throughput (token/s): 108.11, #queue-req: 0, timestamp: 2025-07-15T08:08:50.101911
[2025-07-15 08:08:50] Decode batch. #running-req: 1, #token: 1258, token usage: 0.06, cuda graph: False, gen throughput (token/s): 108.75, #queue-req: 0, timestamp: 2025-07-15T08:08:50.469722
[2025-07-15 08:08:50] Decode batch. #running-req: 1, #token: 1298, token usage: 0.06, cuda graph: False, gen throughput (token/s): 108.42, #queue-req: 0, timestamp: 2025-07-15T08:08:50.838670
[2025-07-15 08:08:51] Decode batch. #running-req: 1, #token: 1338, token usage: 0.07, cuda graph: False, gen throughput (token/s): 108.62, #queue-req: 0, timestamp: 2025-07-15T08:08:51.206932
[2025-07-15 08:08:51] Decode batch. #running-req: 1, #token: 1378, token usage: 0.07, cuda graph: False, gen throughput (token/s): 108.77, #queue-req: 0, timestamp: 2025-07-15T08:08:51.574671
[2025-07-15 08:08:51] Decode batch. #running-req: 1, #token: 1418, token usage: 0.07, cuda graph: False, gen throughput (token/s): 108.35, #queue-req: 0, timestamp: 2025-07-15T08:08:51.943841
[2025-07-15 08:08:52] Decode batch. #running-req: 1, #token: 1458, token usage: 0.07, cuda graph: False, gen throughput (token/s): 107.86, #queue-req: 0, timestamp: 2025-07-15T08:08:52.314691
[2025-07-15 08:08:52] Decode batch. #running-req: 1, #token: 1498, token usage: 0.07, cuda graph: False, gen throughput (token/s): 108.84, #queue-req: 0, timestamp: 2025-07-15T08:08:52.682193
[2025-07-15 08:08:53] Decode batch. #running-req: 1, #token: 1538, token usage: 0.08, cuda graph: False, gen throughput (token/s): 109.15, #queue-req: 0, timestamp: 2025-07-15T08:08:53.048661
[2025-07-15 08:08:53] Decode batch. #running-req: 1, #token: 1578, token usage: 0.08, cuda graph: False, gen throughput (token/s): 109.06, #queue-req: 0, timestamp: 2025-07-15T08:08:53.415433
[2025-07-15 08:08:53] Decode batch. #running-req: 1, #token: 1618, token usage: 0.08, cuda graph: False, gen throughput (token/s): 103.20, #queue-req: 0, timestamp: 2025-07-15T08:08:53.803032
[2025-07-15 08:08:54] Decode batch. #running-req: 1, #token: 1658, token usage: 0.08, cuda graph: False, gen throughput (token/s): 100.05, #queue-req: 0, timestamp: 2025-07-15T08:08:54.202909
[2025-07-15 08:08:54] Decode batch. #running-req: 1, #token: 1698, token usage: 0.08, cuda graph: False, gen throughput (token/s): 108.94, #queue-req: 0, timestamp: 2025-07-15T08:08:54.570006
[2025-07-15 08:08:54] Decode batch. #running-req: 1, #token: 1738, token usage: 0.08, cuda graph: False, gen throughput (token/s): 109.57, #queue-req: 0, timestamp: 2025-07-15T08:08:54.935055
[2025-07-15 08:08:55] Decode batch. #running-req: 1, #token: 1778, token usage: 0.09, cuda graph: False, gen throughput (token/s): 109.64, #queue-req: 0, timestamp: 2025-07-15T08:08:55.299886
[2025-07-15 08:08:55] Decode batch. #running-req: 1, #token: 1818, token usage: 0.09, cuda graph: False, gen throughput (token/s): 109.65, #queue-req: 0, timestamp: 2025-07-15T08:08:55.664679
[2025-07-15 08:08:56] Decode batch. #running-req: 1, #token: 1858, token usage: 0.09, cuda graph: False, gen throughput (token/s): 109.10, #queue-req: 0, timestamp: 2025-07-15T08:08:56.031301
[2025-07-15 08:08:56] Decode batch. #running-req: 1, #token: 1898, token usage: 0.09, cuda graph: False, gen throughput (token/s): 108.18, #queue-req: 0, timestamp: 2025-07-15T08:08:56.401078
[2025-07-15 08:08:56] Decode batch. #running-req: 1, #token: 1938, token usage: 0.09, cuda graph: False, gen throughput (token/s): 108.43, #queue-req: 0, timestamp: 2025-07-15T08:08:56.769977
[2025-07-15 08:08:57] Decode batch. #running-req: 1, #token: 1978, token usage: 0.10, cuda graph: False, gen throughput (token/s): 108.82, #queue-req: 0, timestamp: 2025-07-15T08:08:57.137562
[2025-07-15 08:08:57] Decode batch. #running-req: 1, #token: 2018, token usage: 0.10, cuda graph: False, gen throughput (token/s): 108.89, #queue-req: 0, timestamp: 2025-07-15T08:08:57.504901
[2025-07-15 08:08:57] INFO:     127.0.0.1:60922 - "POST /generate HTTP/1.1" 200 OK
{'text': ' France, and the \n\\( n \\)  \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\(', 'meta_info': {'id': 'a3e1328e9b654c8e9fff7182aa2c1ffe', 'finish_reason': {'type': 'length', 'length': 2048}, 'prompt_tokens': 6, 'completion_tokens': 2048, 'cached_tokens': 1, 'e2e_latency': 18.895754098892212}}

Structural Tag#

[11]:

text = tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
payload = {
    "text": text,
    "sampling_params": {
        "max_new_tokens": 2048,
        "structural_tag": json.dumps(
            {
                "type": "structural_tag",
                "structures": [
                    {
                        "begin": "<function=get_current_weather>",
                        "schema": schema_get_current_weather,
                        "end": "</function>",
                    },
                    {
                        "begin": "<function=get_current_date>",
                        "schema": schema_get_current_date,
                        "end": "</function>",
                    },
                ],
                "triggers": ["<function="],
            }
        ),
    },
}


# Send POST request to the API endpoint
response = requests.post(f"http://localhost:{port}/generate", json=payload)
print_highlight(response.json())

[2025-07-15 08:08:57] Prefill batch. #new-seq: 1, #new-token: 1, #cached-token: 22, token usage: 0.00, #running-req: 0, #queue-req: 0, timestamp: 2025-07-15T08:08:57.846984
[2025-07-15 08:08:57] Decode batch. #running-req: 1, #token: 27, token usage: 0.00, cuda graph: False, gen throughput (token/s): 102.98, #queue-req: 0, timestamp: 2025-07-15T08:08:57.893322
[2025-07-15 08:08:58] Decode batch. #running-req: 1, #token: 67, token usage: 0.00, cuda graph: False, gen throughput (token/s): 110.18, #queue-req: 0, timestamp: 2025-07-15T08:08:58.256356
[2025-07-15 08:08:58] Decode batch. #running-req: 1, #token: 107, token usage: 0.01, cuda graph: False, gen throughput (token/s): 110.21, #queue-req: 0, timestamp: 2025-07-15T08:08:58.619302
[2025-07-15 08:08:58] Decode batch. #running-req: 1, #token: 147, token usage: 0.01, cuda graph: False, gen throughput (token/s): 109.42, #queue-req: 0, timestamp: 2025-07-15T08:08:58.984858
[2025-07-15 08:08:59] Decode batch. #running-req: 1, #token: 187, token usage: 0.01, cuda graph: False, gen throughput (token/s): 110.09, #queue-req: 0, timestamp: 2025-07-15T08:08:59.348182
[2025-07-15 08:08:59] Decode batch. #running-req: 1, #token: 227, token usage: 0.01, cuda graph: False, gen throughput (token/s): 109.73, #queue-req: 0, timestamp: 2025-07-15T08:08:59.712729
[2025-07-15 08:09:00] Decode batch. #running-req: 1, #token: 267, token usage: 0.01, cuda graph: False, gen throughput (token/s): 109.60, #queue-req: 0, timestamp: 2025-07-15T08:09:00.077683
[2025-07-15 08:09:00] Decode batch. #running-req: 1, #token: 307, token usage: 0.01, cuda graph: False, gen throughput (token/s): 109.87, #queue-req: 0, timestamp: 2025-07-15T08:09:00.441746
[2025-07-15 08:09:00] Decode batch. #running-req: 1, #token: 347, token usage: 0.02, cuda graph: False, gen throughput (token/s): 107.81, #queue-req: 0, timestamp: 2025-07-15T08:09:00.812769
[2025-07-15 08:09:01] Decode batch. #running-req: 1, #token: 387, token usage: 0.02, cuda graph: False, gen throughput (token/s): 110.04, #queue-req: 0, timestamp: 2025-07-15T08:09:01.176266
[2025-07-15 08:09:01] Decode batch. #running-req: 1, #token: 427, token usage: 0.02, cuda graph: False, gen throughput (token/s): 109.36, #queue-req: 0, timestamp: 2025-07-15T08:09:01.542053
[2025-07-15 08:09:01] Decode batch. #running-req: 1, #token: 467, token usage: 0.02, cuda graph: False, gen throughput (token/s): 109.59, #queue-req: 0, timestamp: 2025-07-15T08:09:01.907036
[2025-07-15 08:09:02] Decode batch. #running-req: 1, #token: 507, token usage: 0.02, cuda graph: False, gen throughput (token/s): 109.11, #queue-req: 0, timestamp: 2025-07-15T08:09:02.273630
[2025-07-15 08:09:02] Decode batch. #running-req: 1, #token: 547, token usage: 0.03, cuda graph: False, gen throughput (token/s): 107.70, #queue-req: 0, timestamp: 2025-07-15T08:09:02.645046
[2025-07-15 08:09:02] INFO:     127.0.0.1:48330 - "POST /generate HTTP/1.1" 200 OK

{'text': 'Okay, so I need to figure out the population of the capital of France. Let me start by recalling which city is the capital. I know that Paris is the capital of France, so I\'ll focus on that. Now, I think the population of Paris is pretty large because it\'s a major city with a lot of people living there. \n\nI\'m a bit unsure about the exact number, though. I remember reading somewhere that it\'s over 3 million, but I\'m not certain if it\'s still that high or if it\'s changed recently. Maybe I should consider recent estimates. I think the population has been growing over the years, especially with the influx of residents. \n\nAnother thing is that when it comes to city populations, sometimes the numbers can vary based on how the city is divided into conclaves or how estimates are made. I\'m not entirely sure about the current official numbers, so I might need to look that up. Wait, but since I\'m supposed to provide this information in JSON format, I need to be accurate.\n\nI also wonder if the population figure has changed significantly from previous sources. If I recall correctly, in 2020, the population was around 3.5 million. But since then, there might have been some growth. Let me think if there were any major events, like pandemics, that could have affected the population. The COVID-19 pandemic hit Paris hard, so there might have been temporary drops but hopefully, populations recover withgood policy.\n\nAnother aspect is whether the number I have is up-to-date. Since I don\'t have real-time access, I\'ll go with the most recent estimate I remember, which is approximately 3.8 million in the year 2023. I should also note whether this number includes all residents within the metropolitan area or if it\'s a more narrowly defined statistical city, but for this purpose, I think using the broader estimate is sufficient.\n\nPutting it all together, the JSON structure should include the city name and the population number with a unit. So the structure would be something like "capital": { "city": "Paris", "population": 3800000, "unit": "people" }.\n\nI think that\'s about it. I should double-check the latest sources to confirm that this is accurate, but based on what I remember, this should be correct.\n\n\n```json\n{\n "capital": {\n "city": "Paris",\n "population": 3800000,\n "unit": "people"\n }\n}\n```', 'meta_info': {'id': 'd983494bed794d87853582a9b4248a49', 'finish_reason': {'type': 'stop', 'matched': 151643}, 'prompt_tokens': 23, 'completion_tokens': 528, 'cached_tokens': 22, 'e2e_latency': 4.8372578620910645}}

[12]:

terminate_process(server_process)

[2025-07-15 08:09:02] Child process unexpectedly failed with exitcode=9. pid=3616213

Offline Engine API#

[13]:

import sglang as sgl

llm = sgl.Engine(
    model_path="deepseek-ai/DeepSeek-R1-Distill-Qwen-7B",
    reasoning_parser="deepseek-r1",
    grammar_backend="xgrammar",
)

Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  50% Completed | 1/2 [00:01<00:01,  1.42s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:02<00:00,  1.32s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:02<00:00,  1.33s/it]

JSON#

Using Pydantic

[14]:

import json
from pydantic import BaseModel, Field


prompts = [
    "Give me the information of the capital of China in the JSON format.",
    "Give me the information of the capital of France in the JSON format.",
    "Give me the information of the capital of Ireland in the JSON format.",
]


# Define the schema using Pydantic
class CapitalInfo(BaseModel):
    name: str = Field(..., pattern=r"^\w+$", description="Name of the capital city")
    population: int = Field(..., description="Population of the capital city")


sampling_params = {
    "temperature": 0,
    "top_p": 0.95,
    "max_new_tokens": 2048,
    "json_schema": json.dumps(CapitalInfo.model_json_schema()),
}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

===============================
Prompt: Give me the information of the capital of China in the JSON format.
Generated text:
Sure! Here's the information about the capital of China, Beijing, in JSON format:

```json
{
  "name": "Beijing",
  "capital": "Yes",
  "population": "Over 30 million",
  "founded": "1248",
  "Nickname": "The Heaven on Earth",
  "Location": "Northern China",
  "OfficialLanguages": [
    "Mandarin Chinese",
    "Bingyuan Chinese",
    "Tibetan",
    "Hui",
    "Mongolian",
    "Yugou",
    "Tibetan",
    "Hui",
    "Mongolian"
  ],
  "KeySights": [
    "The Great Wall of China",
    "Forbidden City",
    "Tiananmen Square",
    "Beijing Museum",
    "Yuanmingyuan"
  ],
  "Climate": "Temperate"
}
```

Let me know if you need anything else!
===============================
Prompt: Give me the information of the capital of France in the JSON format.
Generated text:
Sure! Here's the information about the capital of France, Paris, in JSON format:

```json
{
  "name": "Paris",
  "country": "France",
  "coordinates": {
    "latitude": 48.8566,
    "longitude": 2.3522
  },
  "founded": "1340",
  "population": "9.7 million",
  "area": "105.5 square kilometers",
  "WX": {
    "averageTemperature": "12°C",
    "precipitation": "590 mm/year"
  },
  "landmarks": [
    "Eiffel Tower",
    "Notre-Dame Cathedral",
    "Louvre Museum",
    "Palace of Versailles"
  ],
  "features": [
    "Seine River",
    "Eiffel Tower",
    "Le Marais District",
    "Château de la Défense"
  ]
}
```

Let me know if you need any other information!
===============================
Prompt: Give me the information of the capital of Ireland in the JSON format.
Generated text:
Sure, here's the information about the capital of Ireland in JSON format:

```json
{
  "capital": "Dublin",
  "official_name": "Dublin, City of Dublin",
  "coordinates": {
    "latitude": 53.3489,
    "longitude": -6.5412
  },
  "founded": "1241",
  "population": "Over 500,000",
  "area": "1,210 km²",
  "climate": " temperate climate with four distinct seasons",
  "key_landmarks": [
    "Leaving Certificate",
    "UCD (University of Dublin)",
    "Trinity College Dublin",
    "Dublin City Hall",
    "GPO (Government House)"
  ],
  "Transport": {
    "public_transport": "efficient and well-developed",
    "roads": "major roads connect to other European cities",
    "railways": "has extensive railway network connecting to the UK and France"
  }
}
```

Let me know if you need any other details!

JSON Schema Directly

[15]:

prompts = [
    "Give me the information of the capital of China in the JSON format.",
    "Give me the information of the capital of France in the JSON format.",
    "Give me the information of the capital of Ireland in the JSON format.",
]

json_schema = json.dumps(
    {
        "type": "object",
        "properties": {
            "name": {"type": "string", "pattern": "^[\\w]+$"},
            "population": {"type": "integer"},
        },
        "required": ["name", "population"],
    }
)

sampling_params = {"temperature": 0, "max_new_tokens": 2048, "json_schema": json_schema}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

===============================
Prompt: Give me the information of the capital of China in the JSON format.
Generated text:
Sure! Here's the information about the capital of China, Beijing, in JSON format:

```json
{
  "name": "Beijing",
  "capital": "Yes",
  "population": "Over 30 million",
  "founded": "1248",
  "Nickname": "The Heaven on Earth",
  "Location": "Northern China",
  "OfficialLanguages": [
    "Mandarin Chinese",
    "Bingyuan Chinese",
    "Tibetan",
    "Hui",
    "Mongolian",
    "Yugou",
    "Tibetan",
    "Hui",
    "Mongolian"
  ],
  "KeySights": [
    "The Great Wall of China",
    "Forbidden City",
    "Tiananmen Square",
    "Beijing Museum",
    "Yuanmingyuan"
  ],
  "Climate": "Temperate"
}
```

Let me know if you need anything else!
===============================
Prompt: Give me the information of the capital of France in the JSON format.
Generated text:
Sure! Here's the information about the capital of France, Paris, in JSON format:

```json
{
  "name": "Paris",
  "country": "France",
  "coordinates": {
    "latitude": 48.8566,
    "longitude": 2.3522
  },
  "founded": "1340",
  "population": "9.7 million",
  "area": "105.5 square kilometers",
  "WX": {
    "averageTemperature": "12°C",
    "precipitation": "540 mm/year"
  },
  "landmarks": [
    "Eiffel Tower",
    "Notre-Dame Cathedral",
    "Louvre Museum",
    "Palace of Versailles"
  ],
  "features": [
    "Seine River",
    "Eiffel Tower",
    "Le Marais District",
    "Château de la Défense"
  ]
}
```

Let me know if you need any other information!
===============================
Prompt: Give me the information of the capital of Ireland in the JSON format.
Generated text:
Sure, here's the information about the capital of Ireland in JSON format:

```json
{
  "capital": "Dublin",
  "official_name": "Dublin, City of Dublin",
  "coordinates": {
    "latitude": 53.3489,
    "longitude": -6.5412
  },
  "founded": "1241",
  "population": "Over 500,000",
  "area": "1,210 km²",
  "climate": " temperate climate with four distinct seasons",
  "key_landmarks": [
    "Leaving Certificate",
    "UCD (University of Dublin)",
    "Trinity College Dublin",
    "Dublin City Hall",
    "GPO (Government House)"
  ],
  "Transportation": {
    "public_transport": "efficient bus and train networks",
    "road": "major highways and a well-developed road network",
    "airport": "Dublin International Airport (DIA)",
    "public_transport": "trams and buses with a frequent service"
  }
}
```

Let me know if you need any other details!

EBNF#

[16]:

prompts = [
    "Give me the information of the capital of France.",
    "Give me the information of the capital of Germany.",
    "Give me the information of the capital of Italy.",
]

sampling_params = {
    "temperature": 0.8,
    "top_p": 0.95,
    "ebnf": (
        "root ::= city | description\n"
        'city ::= "London" | "Paris" | "Berlin" | "Rome"\n'
        'description ::= city " is " status\n'
        'status ::= "the capital of " country\n'
        'country ::= "England" | "France" | "Germany" | "Italy"'
    ),
}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

===============================
Prompt: Give me the information of the capital of France.
Generated text:
The capital of France is Paris. It is located in the northern part of the country. Paris is a major city in France and serves as the administrative center. The city is known for its rich history, beautiful landmarks, and cultural significance.

Please answer the following questions:
1. What is the capital of France?
2. Where is it located?
3. What is its significance in the country?
4. What are some of its famous landmarks?

1. The capital of France is Paris.
2. Paris is located in the northern part of France.
3. Paris is the administrative center of the country and holds significant historical and
===============================
Prompt: Give me the information of the capital of Germany.
Generated text:
The capital of Germany is Berlin.

Sure, I need to provide information about the capital of Germany.

First, I should confirm what the capital of Germany is. Yes, it is Berlin.

Now, I can provide some additional details about Berlin to make the information more comprehensive.

Berlin is located in northern Germany, along the coast of the North Sea. It's a major city in Europe and has a rich history and cultural significance.

I can mention that Berlin has been the capital of Germany since 1949, following World War II. Prior to that, Berlin was divided into sectors by the Allied forces.

It's important to
===============================
Prompt: Give me the information of the capital of Italy.
Generated text: 250 words.

**Question:** What are the main reasons for the rise of the Catholic Church in Italy during the Middle Ages?

**Answer:**
The Catholic Church played a pivotal role in the political and cultural landscape of Italy during the Middle Ages, contributing significantly to its development. One of the primary reasons for its rise was the Church's influence over the allocation of resources and governance within the feudal systems. The Church held control over lands, taxes, and revenue, which gave it substantial political power. Additionally, the Church's education system, particularly the universities, was highly regarded and supported by the Church, fostering intellectual growth. The Church

Regular expression#

[17]:

prompts = [
    "Please provide information about London as a major global city:",
    "Please provide information about Paris as a major global city:",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95, "regex": "(France|England)"}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

===============================
Prompt: Please provide information about London as a major global city:
Generated text:  its population, economy, culture, and history.

1. Population and Growth:
   - Current population of London
   - Rate of population growth
   - Factors influencing population growth

2. Economic Power:
   - GDP and economic growth rate
   - Major industries and sectors
   - Export and import statistics
   - Major companies in London

3. Cultural Scene:
   - Major cultural institutions and landmarks
   - Diverse population and cultural influences
   - Major festivals and events
   - intolerance of foreign culture

4. Historical Background:
   - Earliest beginnings of London
   - Medici family and Medici
===============================
Prompt: Please provide information about Paris as a major global city:
Generated text:  its history, culture, architecture, economy, transportation, and environment.

 Paris is one of the most important and beautiful cities in the world, known for its rich history, vibrant culture, stunning architecture, diverse economy, efficient transportation systems, and commitment to environmental sustainability. It is also a global financial hub and a leader in the arts and sciences.

 History: Paris has a history dating back to ancient times. The city was known as Parisii in antiquity and became a major European center during the Middle Ages. The fall of the Parisian defender in 1345 marked the beginning of the rise of the House of Valence,

[18]:

text = tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
prompts = [text]


sampling_params = {
    "temperature": 0.8,
    "top_p": 0.95,
    "max_new_tokens": 2048,
    "structural_tag": json.dumps(
        {
            "type": "structural_tag",
            "structures": [
                {
                    "begin": "<function=get_current_weather>",
                    "schema": schema_get_current_weather,
                    "end": "</function>",
                },
                {
                    "begin": "<function=get_current_date>",
                    "schema": schema_get_current_date,
                    "end": "</function>",
                },
            ],
            "triggers": ["<function="],
        }
    ),
}


# Send POST request to the API endpoint
outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

===============================
Prompt: <｜begin▁of▁sentence｜><｜Assistant｜>Give me the information and population of the capital of France in the JSON format.<｜end▁of▁sentence｜><｜Assistant｜><think>

Generated text: Alright, the user is asking for the information and population of the capital of France in JSON format. So, first, I need to figure out what exactly they need. They specifically mentioned the capital, which is Paris, so I should focus on that.

I should gather the necessary details about Paris. That includes the city name, population, area, and maybe some key landmarks or facts to make it comprehensive. Since they requested a JSON format, I need to structure the data correctly, probably using an "info" key that contains an object with different sections.

I should also consider the population figure. I know the population of Paris is around 2.1 million, but I should double-check for the most recent data to ensure accuracy. Maybe I can recall that it's approximately 2,155,000 as of 2023.

Next, I'll structure the JSON accordingly. I'll create a main key like "capital_info" which holds an object containing "city_name", "population", "area", and "landmarks". Under "landmarks", I can include Paris's iconic spots like the Eiffel Tower and the Louvre Museum to add more context.

I should make sure the JSON is properly formatted, with commas in the right places and each key enclosed in double quotes. It's also good to include comments or explanations if necessary, but since the user didn't ask for that, I'll stick to the data.

Finally, I'll present the JSON in a clear and organized manner, ensuring that it's easy for the user to understand and use if needed. I'll review the JSON to make sure there are no syntax errors, like missing quotes or incorrect commas, to maintain data integrity.
</think>

Here is the information and population of the capital of France (Paris) in JSON format:

```json
{
  "capital_info": {
    "city_name": "Paris",
    "population": 2155000,
    "area": 107.5,
    "landmarks": [
      "Eiffel Tower",
      "Louvre Museum",
      "Notre-Dame Cathedral",
      "Squares like Montmartre, Latin, and Seine"
    ]
  }
}
```

[19]:

llm.shutdown()

Structured Outputs For Reasoning Models

Contents

Structured Outputs For Reasoning Models#

Supported Models#

Usage#

OpenAI Compatible API#

JSON#

EBNF#

Regular expression#

Structural Tag#

Native API and SGLang Runtime (SRT)#

JSON#

EBNF#

Regular expression#

Structural Tag#

Offline Engine API#

JSON#

EBNF#

Regular expression#