Structured Outputs For Reasoning Models#

When working with reasoning models that use special tokens like <think>...</think> to denote reasoning sections, you might want to allow free-form text within these sections while still enforcing grammar constraints on the rest of the output.

SGLang provides a feature to disable grammar restrictions within reasoning sections. This is particularly useful for models that need to perform complex reasoning steps before providing a structured output.

To enable this feature, use the --reasoning-parser flag which decide the think_end_token, such as </think>, when launching the server. You can also specify the reasoning parser using the --reasoning-parser flag.

Supported Models#

Currently, SGLang supports the following reasoning models:

  • DeepSeek R1 series: The reasoning content is wrapped with <think> and </think> tags.

  • QwQ: The reasoning content is wrapped with <think> and </think> tags.

Usage#

OpenAI Compatible API#

Specify the --grammar-backend, --reasoning-parser option.

[1]:
import openai
import os
from sglang.test.test_utils import is_in_ci

if is_in_ci():
    from patch import launch_server_cmd
else:
    from sglang.utils import launch_server_cmd

from sglang.utils import wait_for_server, print_highlight, terminate_process

os.environ["TOKENIZERS_PARALLELISM"] = "false"


server_process, port = launch_server_cmd(
    "python -m sglang.launch_server --model-path deepseek-ai/DeepSeek-R1-Distill-Qwen-7B --host 0.0.0.0 --reasoning-parser deepseek-r1"
)

wait_for_server(f"http://localhost:{port}")
client = openai.Client(base_url=f"http://127.0.0.1:{port}/v1", api_key="None")
[2025-04-25 07:49:11] server_args=ServerArgs(model_path='deepseek-ai/DeepSeek-R1-Distill-Qwen-7B', tokenizer_path='deepseek-ai/DeepSeek-R1-Distill-Qwen-7B', tokenizer_mode='auto', skip_tokenizer_init=False, enable_tokenizer_batch_encode=False, load_format='auto', trust_remote_code=False, dtype='auto', kv_cache_dtype='auto', quantization=None, quantization_param_path=None, context_length=None, device='cuda', served_model_name='deepseek-ai/DeepSeek-R1-Distill-Qwen-7B', chat_template=None, completion_template=None, is_embedding=False, revision=None, host='0.0.0.0', port=38822, mem_fraction_static=0.88, max_running_requests=200, max_total_tokens=20480, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='fcfs', schedule_conservativeness=1.0, cpu_offload_gb=0, page_size=1, tp_size=1, stream_interval=1, stream_output=False, random_seed=626545175, constrained_json_whitespace_pattern=None, watchdog_timeout=300, dist_timeout=None, download_dir=None, base_gpu_id=0, gpu_id_step=1, log_level='info', log_level_http=None, log_requests=False, log_requests_level=0, show_time_cost=False, enable_metrics=False, decode_log_interval=40, api_key=None, file_storage_path='sglang_storage', enable_cache_report=False, reasoning_parser='deepseek-r1', dp_size=1, load_balance_method='round_robin', ep_size=1, dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', lora_paths=None, max_loras_per_batch=8, lora_backend='triton', attention_backend=None, sampling_backend='flashinfer', grammar_backend='xgrammar', speculative_algorithm=None, speculative_draft_model_path=None, speculative_num_steps=None, speculative_eagle_topk=None, speculative_num_draft_tokens=None, speculative_accept_threshold_single=1.0, speculative_accept_threshold_acc=1.0, speculative_token_map=None, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, disable_radix_cache=False, disable_cuda_graph=True, disable_cuda_graph_padding=False, enable_nccl_nvls=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, enable_multimodal=None, disable_overlap_schedule=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_ep_moe=False, enable_deepep_moe=False, deepep_mode='auto', enable_torch_compile=False, torch_compile_max_bs=32, cuda_graph_max_bs=None, cuda_graph_bs=None, torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, num_continuous_decode_steps=1, delete_ckpt_after_loading=False, enable_memory_saver=False, allow_auto_truncate=False, enable_custom_logit_processor=False, tool_call_parser=None, enable_hierarchical_cache=False, hicache_ratio=2.0, hicache_size=0, hicache_write_policy='write_through_selective', flashinfer_mla_disable_ragged=False, warmups=None, moe_dense_tp_size=None, n_share_experts_fusion=0, disable_chunked_prefix_cache=False, disable_fast_image_processor=False, debug_tensor_dump_output_folder=None, debug_tensor_dump_input_file=None, debug_tensor_dump_inject=False, disaggregation_mode='null', disaggregation_bootstrap_port=8998, disaggregation_transfer_backend='mooncake', disaggregation_ib_device=None)
[2025-04-25 07:49:22 TP0] Attention backend not set. Use fa3 backend by default.
[2025-04-25 07:49:22 TP0] Init torch distributed begin.
[2025-04-25 07:49:23 TP0] Init torch distributed ends. mem usage=0.00 GB
[2025-04-25 07:49:23 TP0] Load weight begin. avail mem=65.41 GB
[2025-04-25 07:49:23 TP0] Ignore import error when loading sglang.srt.models.arctic. No module named 'sglang.srt.layers.fused_moe'
[2025-04-25 07:49:23 TP0] Ignore import error when loading sglang.srt.models.llama4.
[2025-04-25 07:49:23 TP0] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  50% Completed | 1/2 [00:01<00:01,  1.41s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:02<00:00,  1.34s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:02<00:00,  1.35s/it]

[2025-04-25 07:49:26 TP0] Load weight end. type=Qwen2ForCausalLM, dtype=torch.bfloat16, avail mem=39.29 GB, mem usage=26.13 GB.
[2025-04-25 07:49:26 TP0] KV Cache is allocated. #tokens: 20480, K size: 0.55 GB, V size: 0.55 GB
[2025-04-25 07:49:26 TP0] Memory pool end. avail mem=37.91 GB
[2025-04-25 07:49:27 TP0]

CUDA Graph is DISABLED.
This will cause significant performance degradation.
CUDA Graph should almost never be disabled in most usage scenarios.
If you encounter OOM issues, please try setting --mem-fraction-static to a lower value (such as 0.8 or 0.7) instead of disabling CUDA Graph.

[2025-04-25 07:49:27 TP0] max_total_num_tokens=20480, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=200, context_len=131072
[2025-04-25 07:49:27] INFO:     Started server process [183237]
[2025-04-25 07:49:27] INFO:     Waiting for application startup.
[2025-04-25 07:49:27] INFO:     Application startup complete.
[2025-04-25 07:49:27] INFO:     Uvicorn running on http://0.0.0.0:38822 (Press CTRL+C to quit)
[2025-04-25 07:49:28] INFO:     127.0.0.1:51214 - "GET /v1/models HTTP/1.1" 200 OK
[2025-04-25 07:49:28] INFO:     127.0.0.1:51220 - "GET /get_model_info HTTP/1.1" 200 OK
[2025-04-25 07:49:28 TP0] Prefill batch. #new-seq: 1, #new-token: 7, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0,
[2025-04-25 07:49:31] INFO:     127.0.0.1:51228 - "POST /generate HTTP/1.1" 200 OK
[2025-04-25 07:49:31] The server is fired up and ready to roll!


NOTE: Typically, the server runs in a separate terminal.
In this notebook, we run the server and notebook code together, so their outputs are combined.
To improve clarity, the server logs are displayed in the original black color, while the notebook outputs are highlighted in blue.
We are running those notebooks in a CI parallel environment, so the throughput is not representative of the actual performance.

JSON#

you can directly define a JSON schema or use Pydantic to define and validate the response.

Using Pydantic

[2]:
from pydantic import BaseModel, Field


# Define the schema using Pydantic
class CapitalInfo(BaseModel):
    name: str = Field(..., pattern=r"^\w+$", description="Name of the capital city")
    population: int = Field(..., description="Population of the capital city")


response = client.chat.completions.create(
    model="deepseek-ai/DeepSeek-R1-Distill-Qwen-7B",
    messages=[
        {
            "role": "user",
            "content": "Please generate the information of the capital of France in the JSON format.",
        },
    ],
    temperature=0,
    max_tokens=2048,
    response_format={
        "type": "json_schema",
        "json_schema": {
            "name": "foo",
            # convert the pydantic model to json schema
            "schema": CapitalInfo.model_json_schema(),
        },
    },
)

print_highlight(
    f"reasoing_content: {response.choices[0].message.reasoning_content}\n\ncontent: {response.choices[0].message.content}"
)
[2025-04-25 07:49:34 TP0] Prefill batch. #new-seq: 1, #new-token: 18, #cached-token: 1, token usage: 0.00, #running-req: 0, #queue-req: 0,
[2025-04-25 07:49:35 TP0] Decode batch. #running-req: 1, #token: 52, token usage: 0.00, gen throughput (token/s): 5.09, #queue-req: 0,
[2025-04-25 07:49:35 TP0] Decode batch. #running-req: 1, #token: 92, token usage: 0.00, gen throughput (token/s): 104.58, #queue-req: 0,
[2025-04-25 07:49:36 TP0] Decode batch. #running-req: 1, #token: 132, token usage: 0.01, gen throughput (token/s): 103.28, #queue-req: 0,
[2025-04-25 07:49:36 TP0] Decode batch. #running-req: 1, #token: 172, token usage: 0.01, gen throughput (token/s): 103.04, #queue-req: 0,
[2025-04-25 07:49:36 TP0] Decode batch. #running-req: 1, #token: 212, token usage: 0.01, gen throughput (token/s): 98.44, #queue-req: 0,
[2025-04-25 07:49:37 TP0] Decode batch. #running-req: 1, #token: 252, token usage: 0.01, gen throughput (token/s): 98.64, #queue-req: 0,
[2025-04-25 07:49:37 TP0] Decode batch. #running-req: 1, #token: 292, token usage: 0.01, gen throughput (token/s): 101.19, #queue-req: 0,
[2025-04-25 07:49:38 TP0] Decode batch. #running-req: 1, #token: 332, token usage: 0.02, gen throughput (token/s): 101.64, #queue-req: 0,
[2025-04-25 07:49:38 TP0] Decode batch. #running-req: 1, #token: 372, token usage: 0.02, gen throughput (token/s): 101.79, #queue-req: 0,
[2025-04-25 07:49:38 TP0] Decode batch. #running-req: 1, #token: 412, token usage: 0.02, gen throughput (token/s): 95.72, #queue-req: 0,
[2025-04-25 07:49:39 TP0] Decode batch. #running-req: 1, #token: 452, token usage: 0.02, gen throughput (token/s): 100.87, #queue-req: 0,
[2025-04-25 07:49:39 TP0] Decode batch. #running-req: 1, #token: 492, token usage: 0.02, gen throughput (token/s): 101.62, #queue-req: 0,
[2025-04-25 07:49:40 TP0] Decode batch. #running-req: 1, #token: 532, token usage: 0.03, gen throughput (token/s): 96.70, #queue-req: 0,
[2025-04-25 07:49:40 TP0] Decode batch. #running-req: 1, #token: 572, token usage: 0.03, gen throughput (token/s): 99.41, #queue-req: 0,
[2025-04-25 07:49:40 TP0] Decode batch. #running-req: 1, #token: 612, token usage: 0.03, gen throughput (token/s): 99.35, #queue-req: 0,
[2025-04-25 07:49:41 TP0] Decode batch. #running-req: 1, #token: 652, token usage: 0.03, gen throughput (token/s): 99.10, #queue-req: 0,
[2025-04-25 07:49:41 TP0] Decode batch. #running-req: 1, #token: 692, token usage: 0.03, gen throughput (token/s): 99.71, #queue-req: 0,
[2025-04-25 07:49:42 TP0] Decode batch. #running-req: 1, #token: 732, token usage: 0.04, gen throughput (token/s): 99.00, #queue-req: 0,
[2025-04-25 07:49:42 TP0] Decode batch. #running-req: 1, #token: 772, token usage: 0.04, gen throughput (token/s): 101.70, #queue-req: 0,
[2025-04-25 07:49:42 TP0] Decode batch. #running-req: 1, #token: 812, token usage: 0.04, gen throughput (token/s): 95.93, #queue-req: 0,
[2025-04-25 07:49:43 TP0] Decode batch. #running-req: 1, #token: 852, token usage: 0.04, gen throughput (token/s): 97.39, #queue-req: 0,
[2025-04-25 07:49:43 TP0] Decode batch. #running-req: 1, #token: 892, token usage: 0.04, gen throughput (token/s): 99.32, #queue-req: 0,
[2025-04-25 07:49:44 TP0] Decode batch. #running-req: 1, #token: 932, token usage: 0.05, gen throughput (token/s): 101.20, #queue-req: 0,
[2025-04-25 07:49:44 TP0] Decode batch. #running-req: 1, #token: 972, token usage: 0.05, gen throughput (token/s): 99.92, #queue-req: 0,
[2025-04-25 07:49:44 TP0] Decode batch. #running-req: 1, #token: 1012, token usage: 0.05, gen throughput (token/s): 97.12, #queue-req: 0,
[2025-04-25 07:49:45 TP0] Decode batch. #running-req: 1, #token: 1052, token usage: 0.05, gen throughput (token/s): 99.19, #queue-req: 0,
[2025-04-25 07:49:45 TP0] Decode batch. #running-req: 1, #token: 1092, token usage: 0.05, gen throughput (token/s): 101.57, #queue-req: 0,
[2025-04-25 07:49:46 TP0] Decode batch. #running-req: 1, #token: 1132, token usage: 0.06, gen throughput (token/s): 99.75, #queue-req: 0,
[2025-04-25 07:49:46 TP0] Decode batch. #running-req: 1, #token: 1172, token usage: 0.06, gen throughput (token/s): 96.00, #queue-req: 0,
[2025-04-25 07:49:46 TP0] Decode batch. #running-req: 1, #token: 1212, token usage: 0.06, gen throughput (token/s): 97.95, #queue-req: 0,
[2025-04-25 07:49:47 TP0] Decode batch. #running-req: 1, #token: 1252, token usage: 0.06, gen throughput (token/s): 101.60, #queue-req: 0,
[2025-04-25 07:49:47 TP0] Decode batch. #running-req: 1, #token: 1292, token usage: 0.06, gen throughput (token/s): 97.45, #queue-req: 0,
[2025-04-25 07:49:48 TP0] Decode batch. #running-req: 1, #token: 1332, token usage: 0.07, gen throughput (token/s): 99.99, #queue-req: 0,
[2025-04-25 07:49:48 TP0] Decode batch. #running-req: 1, #token: 1372, token usage: 0.07, gen throughput (token/s): 99.78, #queue-req: 0,
[2025-04-25 07:49:48 TP0] Decode batch. #running-req: 1, #token: 1412, token usage: 0.07, gen throughput (token/s): 100.17, #queue-req: 0,
[2025-04-25 07:49:49 TP0] Decode batch. #running-req: 1, #token: 1452, token usage: 0.07, gen throughput (token/s): 102.68, #queue-req: 0,
[2025-04-25 07:49:49 TP0] Decode batch. #running-req: 1, #token: 1492, token usage: 0.07, gen throughput (token/s): 99.75, #queue-req: 0,
[2025-04-25 07:49:50 TP0] Decode batch. #running-req: 1, #token: 1532, token usage: 0.07, gen throughput (token/s): 100.28, #queue-req: 0,
[2025-04-25 07:49:50 TP0] Decode batch. #running-req: 1, #token: 1572, token usage: 0.08, gen throughput (token/s): 97.98, #queue-req: 0,
[2025-04-25 07:49:50 TP0] Decode batch. #running-req: 1, #token: 1612, token usage: 0.08, gen throughput (token/s): 99.48, #queue-req: 0,
[2025-04-25 07:49:51 TP0] Decode batch. #running-req: 1, #token: 1652, token usage: 0.08, gen throughput (token/s): 100.89, #queue-req: 0,
[2025-04-25 07:49:51 TP0] Decode batch. #running-req: 1, #token: 1692, token usage: 0.08, gen throughput (token/s): 98.51, #queue-req: 0,
[2025-04-25 07:49:52 TP0] Decode batch. #running-req: 1, #token: 1732, token usage: 0.08, gen throughput (token/s): 97.63, #queue-req: 0,
[2025-04-25 07:49:52 TP0] Decode batch. #running-req: 1, #token: 1772, token usage: 0.09, gen throughput (token/s): 99.82, #queue-req: 0,
[2025-04-25 07:49:52 TP0] Decode batch. #running-req: 1, #token: 1812, token usage: 0.09, gen throughput (token/s): 102.56, #queue-req: 0,
[2025-04-25 07:49:53 TP0] Decode batch. #running-req: 1, #token: 1852, token usage: 0.09, gen throughput (token/s): 100.31, #queue-req: 0,
[2025-04-25 07:49:53 TP0] Decode batch. #running-req: 1, #token: 1892, token usage: 0.09, gen throughput (token/s): 97.37, #queue-req: 0,
[2025-04-25 07:49:54 TP0] Decode batch. #running-req: 1, #token: 1932, token usage: 0.09, gen throughput (token/s): 99.54, #queue-req: 0,
[2025-04-25 07:49:54 TP0] Decode batch. #running-req: 1, #token: 1972, token usage: 0.10, gen throughput (token/s): 99.86, #queue-req: 0,
[2025-04-25 07:49:54 TP0] Decode batch. #running-req: 1, #token: 2012, token usage: 0.10, gen throughput (token/s): 99.14, #queue-req: 0,
[2025-04-25 07:49:55 TP0] Decode batch. #running-req: 1, #token: 2052, token usage: 0.10, gen throughput (token/s): 96.95, #queue-req: 0,
[2025-04-25 07:49:55] INFO:     127.0.0.1:51230 - "POST /v1/chat/completions HTTP/1.1" 200 OK
reasoing_content: Okay, so I need to generate information about the capital of France, Paris, in JSON format. Let me think about how to approach this. First, I should recall what Paris is known for. It's the main city in France, right? I know it's a major cultural, economic, and political center.

I should probably start by listing the basic facts. The capital of France is Paris, so that's straightforward. The country it's the capital of is France, which I can confirm. The location is in the northern part of the country, near the Seine River. I remember that Paris is located in the Île-de-France region, which is a large area including other cities like Lyon and Marseille.

Next, I should think about the population. I think Paris is the second-largest city in France, after metropolitan Paris, which includes a much larger area. The population numbers might be around 2 million for the city proper and 8 million for the metropolitan area. I should double-check that, but I'm pretty sure that's correct.

Moving on to landmarks, the Eiffel Tower is a must. It's a symbol of the city and the country. The Louvre Museum is another famous landmark, one of the largest art museums in the world. The Paris Opera House is also iconic, especially for its architecture. The Arc de Triomphe is a significant historical monument, and Notre-Dame, despite the recent issues, is still a major attraction, though it's currently undergoing renovations.

I should include some key facts about Paris. It's known for its rich history, being the birthplace of many famous people like Victor Hugo, Ernest Hemingway, and others. It's also a global city with a vibrant cultural scene, hosting events like the French网球公开赛 and the Tour de France. The cuisine is a big part of its identity, with famous dishes like croissant and boeuf bourguignon.

Transportation is another area. Paris has an extensive public transportation system, including the Métro, which is a large subway network. The RER is another rail network that connects to other cities. Taxis are also a common mode of transportation, and there are bike lanes throughout the city, especially in the Île-de-France region.

I should structure this information into a JSON format. The JSON should have a key for the capital, which is "Paris", and then an object containing the details. I'll list each piece of information as a key-value pair under the "capital" key. I need to make sure the JSON is properly formatted with commas and brackets, and that strings are enclosed in quotes.

Wait, I should also consider the population numbers. I think the population of Paris itself is around 2.1 million, while the metropolitan area is about 8.5 million. I should include that. Also, the area of Paris is approximately 105 square kilometers, and the metropolitan area is about 12,500 square kilometers.

I should also mention the time zone. Paris is in Central European Time (CET) during standard time and Central European Summer Time (CEST) in summer. That's important for international users.

Let me organize all this information into a JSON structure. I'll start with the capital key, then include the population, location, landmarks, key facts, transportation, and area. I'll make sure each key is descriptive and the values are accurate.

I think I've covered all the main points. Now, I'll format it correctly, ensuring that the JSON syntax is correct with proper commas and brackets. I'll avoid any markdown formatting as per the instructions and just present the JSON.


content: {

"name": "Paris",
"population": 214300000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000

JSON Schema Directly

[3]:
import json

json_schema = json.dumps(
    {
        "type": "object",
        "properties": {
            "name": {"type": "string", "pattern": "^[\\w]+$"},
            "population": {"type": "integer"},
        },
        "required": ["name", "population"],
    }
)

response = client.chat.completions.create(
    model="deepseek-ai/DeepSeek-R1-Distill-Qwen-7B",
    messages=[
        {
            "role": "user",
            "content": "Give me the information of the capital of France in the JSON format.",
        },
    ],
    temperature=0,
    max_tokens=2048,
    response_format={
        "type": "json_schema",
        "json_schema": {"name": "foo", "schema": json.loads(json_schema)},
    },
)

print_highlight(
    f"reasoing_content: {response.choices[0].message.reasoning_content}\n\ncontent: {response.choices[0].message.content}"
)
[2025-04-25 07:49:55 TP0] Prefill batch. #new-seq: 1, #new-token: 17, #cached-token: 2, token usage: 0.00, #running-req: 0, #queue-req: 0,
[2025-04-25 07:49:55 TP0] Decode batch. #running-req: 1, #token: 44, token usage: 0.00, gen throughput (token/s): 74.33, #queue-req: 0,
[2025-04-25 07:49:56 TP0] Decode batch. #running-req: 1, #token: 84, token usage: 0.00, gen throughput (token/s): 103.12, #queue-req: 0,
[2025-04-25 07:49:56 TP0] Decode batch. #running-req: 1, #token: 124, token usage: 0.01, gen throughput (token/s): 102.10, #queue-req: 0,
[2025-04-25 07:49:57 TP0] Decode batch. #running-req: 1, #token: 164, token usage: 0.01, gen throughput (token/s): 101.34, #queue-req: 0,
[2025-04-25 07:49:57 TP0] Decode batch. #running-req: 1, #token: 204, token usage: 0.01, gen throughput (token/s): 98.94, #queue-req: 0,
[2025-04-25 07:49:57 TP0] Decode batch. #running-req: 1, #token: 244, token usage: 0.01, gen throughput (token/s): 101.84, #queue-req: 0,
[2025-04-25 07:49:58 TP0] Decode batch. #running-req: 1, #token: 284, token usage: 0.01, gen throughput (token/s): 104.70, #queue-req: 0,
[2025-04-25 07:49:58 TP0] Decode batch. #running-req: 1, #token: 324, token usage: 0.02, gen throughput (token/s): 102.68, #queue-req: 0,
[2025-04-25 07:49:59 TP0] Decode batch. #running-req: 1, #token: 364, token usage: 0.02, gen throughput (token/s): 90.79, #queue-req: 0,
[2025-04-25 07:49:59 TP0] Decode batch. #running-req: 1, #token: 404, token usage: 0.02, gen throughput (token/s): 104.16, #queue-req: 0,
[2025-04-25 07:49:59 TP0] Decode batch. #running-req: 1, #token: 444, token usage: 0.02, gen throughput (token/s): 96.70, #queue-req: 0,
[2025-04-25 07:50:00 TP0] Decode batch. #running-req: 1, #token: 484, token usage: 0.02, gen throughput (token/s): 102.69, #queue-req: 0,
[2025-04-25 07:50:00 TP0] Decode batch. #running-req: 1, #token: 524, token usage: 0.03, gen throughput (token/s): 102.19, #queue-req: 0,
[2025-04-25 07:50:01 TP0] Decode batch. #running-req: 1, #token: 564, token usage: 0.03, gen throughput (token/s): 100.98, #queue-req: 0,
[2025-04-25 07:50:01 TP0] Decode batch. #running-req: 1, #token: 604, token usage: 0.03, gen throughput (token/s): 100.47, #queue-req: 0,
[2025-04-25 07:50:01 TP0] Decode batch. #running-req: 1, #token: 644, token usage: 0.03, gen throughput (token/s): 100.10, #queue-req: 0,
[2025-04-25 07:50:02 TP0] Decode batch. #running-req: 1, #token: 684, token usage: 0.03, gen throughput (token/s): 100.42, #queue-req: 0,
[2025-04-25 07:50:02 TP0] Decode batch. #running-req: 1, #token: 724, token usage: 0.04, gen throughput (token/s): 98.64, #queue-req: 0,
[2025-04-25 07:50:03 TP0] Decode batch. #running-req: 1, #token: 764, token usage: 0.04, gen throughput (token/s): 99.28, #queue-req: 0,
[2025-04-25 07:50:03 TP0] Decode batch. #running-req: 1, #token: 804, token usage: 0.04, gen throughput (token/s): 98.15, #queue-req: 0,
[2025-04-25 07:50:03 TP0] Decode batch. #running-req: 1, #token: 844, token usage: 0.04, gen throughput (token/s): 103.44, #queue-req: 0,
[2025-04-25 07:50:04 TP0] Decode batch. #running-req: 1, #token: 884, token usage: 0.04, gen throughput (token/s): 101.36, #queue-req: 0,
[2025-04-25 07:50:04 TP0] Decode batch. #running-req: 1, #token: 924, token usage: 0.05, gen throughput (token/s): 101.56, #queue-req: 0,
[2025-04-25 07:50:05 TP0] Decode batch. #running-req: 1, #token: 964, token usage: 0.05, gen throughput (token/s): 101.39, #queue-req: 0,
[2025-04-25 07:50:05 TP0] Decode batch. #running-req: 1, #token: 1004, token usage: 0.05, gen throughput (token/s): 98.91, #queue-req: 0,
[2025-04-25 07:50:05 TP0] Decode batch. #running-req: 1, #token: 1044, token usage: 0.05, gen throughput (token/s): 103.39, #queue-req: 0,
[2025-04-25 07:50:06 TP0] Decode batch. #running-req: 1, #token: 1084, token usage: 0.05, gen throughput (token/s): 100.80, #queue-req: 0,
[2025-04-25 07:50:06 TP0] Decode batch. #running-req: 1, #token: 1124, token usage: 0.05, gen throughput (token/s): 100.55, #queue-req: 0,
[2025-04-25 07:50:07 TP0] Decode batch. #running-req: 1, #token: 1164, token usage: 0.06, gen throughput (token/s): 99.87, #queue-req: 0,
[2025-04-25 07:50:07 TP0] Decode batch. #running-req: 1, #token: 1204, token usage: 0.06, gen throughput (token/s): 97.93, #queue-req: 0,
[2025-04-25 07:50:07 TP0] Decode batch. #running-req: 1, #token: 1244, token usage: 0.06, gen throughput (token/s): 100.47, #queue-req: 0,
[2025-04-25 07:50:08 TP0] Decode batch. #running-req: 1, #token: 1284, token usage: 0.06, gen throughput (token/s): 100.61, #queue-req: 0,
[2025-04-25 07:50:08 TP0] Decode batch. #running-req: 1, #token: 1324, token usage: 0.06, gen throughput (token/s): 100.16, #queue-req: 0,
[2025-04-25 07:50:09 TP0] Decode batch. #running-req: 1, #token: 1364, token usage: 0.07, gen throughput (token/s): 102.74, #queue-req: 0,
[2025-04-25 07:50:09 TP0] Decode batch. #running-req: 1, #token: 1404, token usage: 0.07, gen throughput (token/s): 98.43, #queue-req: 0,
[2025-04-25 07:50:09 TP0] Decode batch. #running-req: 1, #token: 1444, token usage: 0.07, gen throughput (token/s): 100.36, #queue-req: 0,
[2025-04-25 07:50:10 TP0] Decode batch. #running-req: 1, #token: 1484, token usage: 0.07, gen throughput (token/s): 102.64, #queue-req: 0,
[2025-04-25 07:50:10 TP0] Decode batch. #running-req: 1, #token: 1524, token usage: 0.07, gen throughput (token/s): 98.10, #queue-req: 0,
[2025-04-25 07:50:11 TP0] Decode batch. #running-req: 1, #token: 1564, token usage: 0.08, gen throughput (token/s): 100.34, #queue-req: 0,
[2025-04-25 07:50:11 TP0] Decode batch. #running-req: 1, #token: 1604, token usage: 0.08, gen throughput (token/s): 100.47, #queue-req: 0,
[2025-04-25 07:50:11 TP0] Decode batch. #running-req: 1, #token: 1644, token usage: 0.08, gen throughput (token/s): 99.63, #queue-req: 0,
[2025-04-25 07:50:12 TP0] Decode batch. #running-req: 1, #token: 1684, token usage: 0.08, gen throughput (token/s): 97.60, #queue-req: 0,
[2025-04-25 07:50:12 TP0] Decode batch. #running-req: 1, #token: 1724, token usage: 0.08, gen throughput (token/s): 99.12, #queue-req: 0,
[2025-04-25 07:50:13 TP0] Decode batch. #running-req: 1, #token: 1764, token usage: 0.09, gen throughput (token/s): 99.25, #queue-req: 0,
[2025-04-25 07:50:13 TP0] Decode batch. #running-req: 1, #token: 1804, token usage: 0.09, gen throughput (token/s): 97.58, #queue-req: 0,
[2025-04-25 07:50:13 TP0] Decode batch. #running-req: 1, #token: 1844, token usage: 0.09, gen throughput (token/s): 101.50, #queue-req: 0,
[2025-04-25 07:50:14 TP0] Decode batch. #running-req: 1, #token: 1884, token usage: 0.09, gen throughput (token/s): 99.26, #queue-req: 0,
[2025-04-25 07:50:14 TP0] Decode batch. #running-req: 1, #token: 1924, token usage: 0.09, gen throughput (token/s): 96.94, #queue-req: 0,
[2025-04-25 07:50:15 TP0] Decode batch. #running-req: 1, #token: 1964, token usage: 0.10, gen throughput (token/s): 98.85, #queue-req: 0,
[2025-04-25 07:50:15 TP0] Decode batch. #running-req: 1, #token: 2004, token usage: 0.10, gen throughput (token/s): 96.41, #queue-req: 0,
[2025-04-25 07:50:15 TP0] Decode batch. #running-req: 1, #token: 2044, token usage: 0.10, gen throughput (token/s): 99.43, #queue-req: 0,
[2025-04-25 07:50:16] INFO:     127.0.0.1:51230 - "POST /v1/chat/completions HTTP/1.1" 200 OK
reasoing_content: Okay, so I need to figure out the information about the capital of France and present it in JSON format. Let me start by recalling what I know about Paris. I know it's the capital, but I should probably double-check that. Yeah, I'm pretty sure Paris is the administrative capital, but sometimes people might confuse it with other cities like Lyon or Marseille. But no, Paris is definitely the official capital.

Now, moving on to the population. I think Paris is a very large city, one of the biggest in the world. I remember reading somewhere that it's over 3 million people, but I'm not sure of the exact figure. Maybe around 3.5 million? I should look that up to be accurate, but since I'm just brainstorming, I'll go with that estimate.

Next, the area. Paris is a big city, but it's also a dense urban area. I think the metropolitan area covers a large region, maybe around 12,000 square kilometers? But the city proper is smaller. I'm not exactly sure, but I'll put 10,500 square kilometers for the city area and 12,000 for the metropolitan area.

Language is another point. Paris is a center for French culture, so the predominant language there is definitely French. I don't think they speak any other language there predominantly, though there might be some English, especially in tourist areas or with expatriates.

Cuisine is interesting. Paris is known for its high-end, fine dining, especially French cuisine. I know places like Le Faitout and others that are famous for their intricate dishes. Parisians are also known for their coffee culture, so maybe that's another point to include.

Transportation-wise, Paris has an extensive public transit system. The RER and地铁 are part of the BTP, which I think stands for Bahn, Tram, and Metro in German, but in French, it's the same. The city is well-connected by train, with major stations like Gare du Nord and Châtelet. The Eiffel Tower is a major landmark, and it's accessible by train from Paris.

I should also mention some of the main attractions. The Eiffel Tower is iconic, along with the Louvre Museum, Notre-Dame Cathedral, and the Sacré-Cœur Basilica. These are must-see spots for tourists.

Now, putting this all together into JSON format. I'll structure it with an "info" key that contains a "capital" object with "name," "population," "area," and "language." Then, an "attractions" array that lists the main points of interest. I'll make sure the numbers are approximate since I don't have exact figures on hand.

Wait, I should check if the population is over 3 million or 3.5. I think it's around 3.5 million as of recent estimates. The area, I'm pretty sure the metropolitan area is about 12,000 km², and the city proper is a bit less, maybe 10,500 km². That seems right.

So, the JSON structure would have the info object with the necessary details, and the attractions array listing the main landmarks. I think that covers everything the user asked for. I should present it clearly, making sure the JSON is properly formatted with commas and quotes.


content: {

"name": "Paris",
"population": 350000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000

EBNF#

[4]:
ebnf_grammar = """
root ::= city | description
city ::= "London" | "Paris" | "Berlin" | "Rome"
description ::= city " is " status
status ::= "the capital of " country
country ::= "England" | "France" | "Germany" | "Italy"
"""

response = client.chat.completions.create(
    model="deepseek-ai/DeepSeek-R1-Distill-Qwen-7B",
    messages=[
        {"role": "system", "content": "You are a helpful geography bot."},
        {
            "role": "user",
            "content": "Give me the information of the capital of France.",
        },
    ],
    temperature=0,
    max_tokens=2048,
    extra_body={"ebnf": ebnf_grammar},
)

print_highlight(
    f"reasoing_content: {response.choices[0].message.reasoning_content}\n\ncontent: {response.choices[0].message.content}"
)
[2025-04-25 07:50:16 TP0] Prefill batch. #new-seq: 1, #new-token: 21, #cached-token: 1, token usage: 0.00, #running-req: 0, #queue-req: 0,
[2025-04-25 07:50:16 TP0] Decode batch. #running-req: 1, #token: 39, token usage: 0.00, gen throughput (token/s): 89.48, #queue-req: 0,
[2025-04-25 07:50:16 TP0] Decode batch. #running-req: 1, #token: 79, token usage: 0.00, gen throughput (token/s): 101.09, #queue-req: 0,
[2025-04-25 07:50:17 TP0] Decode batch. #running-req: 1, #token: 119, token usage: 0.01, gen throughput (token/s): 101.36, #queue-req: 0,
[2025-04-25 07:50:17] INFO:     127.0.0.1:51230 - "POST /v1/chat/completions HTTP/1.1" 200 OK
reasoing_content: Okay, so I need to figure out the capital of France. I remember that France is a country in Europe, and I think its capital is Paris. But wait, I'm not entirely sure. Let me think about other capitals I know. Germany's capital is Berlin, Italy's is Rome, Spain's is Madrid, and the UK's is London. Yeah, Paris seems right for France. I don't recall any other city being the capital of France. Maybe I should double-check, but I'm pretty confident it's Paris.


content: Paris is the capital of France

Regular expression#

[5]:
response = client.chat.completions.create(
    model="deepseek-ai/DeepSeek-R1-Distill-Qwen-7B",
    messages=[
        {"role": "user", "content": "What is the capital of France?"},
    ],
    temperature=0,
    max_tokens=2048,
    extra_body={"regex": "(Paris|London)"},
)

print_highlight(
    f"reasoing_content: {response.choices[0].message.reasoning_content}\n\ncontent: {response.choices[0].message.content}"
)
[2025-04-25 07:50:17 TP0] Prefill batch. #new-seq: 1, #new-token: 10, #cached-token: 2, token usage: 0.00, #running-req: 0, #queue-req: 0,
[2025-04-25 07:50:17 TP0] Decode batch. #running-req: 1, #token: 33, token usage: 0.00, gen throughput (token/s): 92.06, #queue-req: 0,
[2025-04-25 07:50:17 TP0] Decode batch. #running-req: 1, #token: 73, token usage: 0.00, gen throughput (token/s): 98.69, #queue-req: 0,
[2025-04-25 07:50:18 TP0] Decode batch. #running-req: 1, #token: 113, token usage: 0.01, gen throughput (token/s): 101.77, #queue-req: 0,
[2025-04-25 07:50:18 TP0] Decode batch. #running-req: 1, #token: 153, token usage: 0.01, gen throughput (token/s): 103.47, #queue-req: 0,
[2025-04-25 07:50:19 TP0] Decode batch. #running-req: 1, #token: 193, token usage: 0.01, gen throughput (token/s): 101.60, #queue-req: 0,
[2025-04-25 07:50:19 TP0] Decode batch. #running-req: 1, #token: 233, token usage: 0.01, gen throughput (token/s): 101.65, #queue-req: 0,
[2025-04-25 07:50:19 TP0] Decode batch. #running-req: 1, #token: 273, token usage: 0.01, gen throughput (token/s): 101.62, #queue-req: 0,
[2025-04-25 07:50:20 TP0] Decode batch. #running-req: 1, #token: 313, token usage: 0.02, gen throughput (token/s): 101.69, #queue-req: 0,
[2025-04-25 07:50:20 TP0] Decode batch. #running-req: 1, #token: 353, token usage: 0.02, gen throughput (token/s): 95.04, #queue-req: 0,
[2025-04-25 07:50:20] INFO:     127.0.0.1:51230 - "POST /v1/chat/completions HTTP/1.1" 200 OK
reasoing_content: Okay, so I need to figure out the capital of France. Hmm, I remember learning a bit about France in school, but I'm not 100% sure. Let me think. I know that Paris is a major city in France, and it's often referred to as the "City of Light." People go there for museums, landmarks like the Eiffel Tower, and it's a cultural hub. But is it the capital?

Wait, I think the capital is the official seat of government, right? So maybe Paris is both the capital and the most famous city. But I'm not entirely certain. I recall that some countries have their capital in a different city than their main tourist attraction. For example, I think Brazil's capital is not Rio de Janeiro, which is more famous. So maybe France is like that too.

Let me try to remember any specific information. I think the French government declares Paris as the capital. Yeah, that sounds right. I also remember that the Eiffel Tower is in Paris, and it's a symbol of the country. So if Paris is the capital, then that makes sense. But I'm a bit confused because sometimes people say "the capital of France is Paris," but I also think about other capitals I know, like London for the UK or Berlin for Germany. So maybe it's the same for France.

I should also consider if there are any other capitals in France. I don't think so. France has only one capital city, which is Paris. So, putting it all together, I'm pretty confident that Paris is the capital of France. It's the main government building area, and it's the most well-known city in the country. Yeah, I think that's correct.


content: Paris

Structural Tag#

[6]:
tool_get_current_weather = {
    "type": "function",
    "function": {
        "name": "get_current_weather",
        "description": "Get the current weather in a given location",
        "parameters": {
            "type": "object",
            "properties": {
                "city": {
                    "type": "string",
                    "description": "The city to find the weather for, e.g. 'San Francisco'",
                },
                "state": {
                    "type": "string",
                    "description": "the two-letter abbreviation for the state that the city is"
                    " in, e.g. 'CA' which would mean 'California'",
                },
                "unit": {
                    "type": "string",
                    "description": "The unit to fetch the temperature in",
                    "enum": ["celsius", "fahrenheit"],
                },
            },
            "required": ["city", "state", "unit"],
        },
    },
}

tool_get_current_date = {
    "type": "function",
    "function": {
        "name": "get_current_date",
        "description": "Get the current date and time for a given timezone",
        "parameters": {
            "type": "object",
            "properties": {
                "timezone": {
                    "type": "string",
                    "description": "The timezone to fetch the current date and time for, e.g. 'America/New_York'",
                }
            },
            "required": ["timezone"],
        },
    },
}

schema_get_current_weather = tool_get_current_weather["function"]["parameters"]
schema_get_current_date = tool_get_current_date["function"]["parameters"]


def get_messages():
    return [
        {
            "role": "system",
            "content": f"""
# Tool Instructions
- Always execute python code in messages that you share.
- When looking for real time information use relevant functions if available else fallback to brave_search
You have access to the following functions:
Use the function 'get_current_weather' to: Get the current weather in a given location
{tool_get_current_weather["function"]}
Use the function 'get_current_date' to: Get the current date and time for a given timezone
{tool_get_current_date["function"]}
If a you choose to call a function ONLY reply in the following format:
<{{start_tag}}={{function_name}}>{{parameters}}{{end_tag}}
where
start_tag => `<function`
parameters => a JSON dict with the function argument name as key and function argument value as value.
end_tag => `</function>`
Here is an example,
<function=example_function_name>{{"example_name": "example_value"}}</function>
Reminder:
- Function calls MUST follow the specified format
- Required parameters MUST be specified
- Only call one function at a time
- Put the entire function call reply on one line
- Always add your sources when using search results to answer the user query
You are a helpful assistant.""",
        },
        {
            "role": "user",
            "content": "You are in New York. Please get the current date and time, and the weather.",
        },
    ]


messages = get_messages()

response = client.chat.completions.create(
    model="deepseek-ai/DeepSeek-R1-Distill-Qwen-7B",
    messages=messages,
    response_format={
        "type": "structural_tag",
        "max_new_tokens": 2048,
        "structures": [
            {
                "begin": "<function=get_current_weather>",
                "schema": schema_get_current_weather,
                "end": "</function>",
            },
            {
                "begin": "<function=get_current_date>",
                "schema": schema_get_current_date,
                "end": "</function>",
            },
        ],
        "triggers": ["<function="],
    },
)

print_highlight(
    f"reasoing_content: {response.choices[0].message.reasoning_content}\n\ncontent: {response.choices[0].message.content}"
)
[2025-04-25 07:50:21 TP0] Prefill batch. #new-seq: 1, #new-token: 471, #cached-token: 1, token usage: 0.00, #running-req: 0, #queue-req: 0,
[2025-04-25 07:50:21 TP0] Decode batch. #running-req: 1, #token: 495, token usage: 0.02, gen throughput (token/s): 44.41, #queue-req: 0,
[2025-04-25 07:50:22 TP0] Decode batch. #running-req: 1, #token: 535, token usage: 0.03, gen throughput (token/s): 100.38, #queue-req: 0,
[2025-04-25 07:50:22 TP0] Decode batch. #running-req: 1, #token: 575, token usage: 0.03, gen throughput (token/s): 99.73, #queue-req: 0,
[2025-04-25 07:50:22 TP0] Decode batch. #running-req: 1, #token: 615, token usage: 0.03, gen throughput (token/s): 96.90, #queue-req: 0,
[2025-04-25 07:50:23 TP0] Decode batch. #running-req: 1, #token: 655, token usage: 0.03, gen throughput (token/s): 100.98, #queue-req: 0,
[2025-04-25 07:50:23 TP0] Decode batch. #running-req: 1, #token: 695, token usage: 0.03, gen throughput (token/s): 98.88, #queue-req: 0,
[2025-04-25 07:50:24 TP0] Decode batch. #running-req: 1, #token: 735, token usage: 0.04, gen throughput (token/s): 87.04, #queue-req: 0,
[2025-04-25 07:50:24 TP0] Decode batch. #running-req: 1, #token: 775, token usage: 0.04, gen throughput (token/s): 98.61, #queue-req: 0,
[2025-04-25 07:50:24 TP0] Decode batch. #running-req: 1, #token: 815, token usage: 0.04, gen throughput (token/s): 99.72, #queue-req: 0,
[2025-04-25 07:50:25 TP0] Decode batch. #running-req: 1, #token: 855, token usage: 0.04, gen throughput (token/s): 98.69, #queue-req: 0,
[2025-04-25 07:50:25 TP0] Decode batch. #running-req: 1, #token: 895, token usage: 0.04, gen throughput (token/s): 96.24, #queue-req: 0,
[2025-04-25 07:50:25] INFO:     127.0.0.1:51230 - "POST /v1/chat/completions HTTP/1.1" 200 OK
reasoing_content: Okay, so the user is asking for the current date and time in New York and the weather there. Let me figure out how to approach this step by step.

First, I need to determine which functions to use. The user mentioned they are in New York, so I should get the current date and time for that location. Looking at the functions provided, there's 'get_current_date' which requires a timezone parameter. New York is in the 'America/New_York' timezone, so I'll use that.

Next, for the weather, the user wants the current conditions in New York. The function 'get_current_weather' requires a city, state, and unit. I know the city is New York, but I need the state abbreviation. New York is NY, so the state is 'NY'. The unit can be either Celsius or Fahrenheit; the user didn't specify, so I'll include both options in the parameters to show flexibility.

Now, I'll structure the function calls. I'll start with 'get_current_date', providing the timezone as 'America/New_York'. Then, I'll call 'get_current_weather' with city, state, and both units. This way, the user gets both the date/time and the weather in their preferred temperature unit.

I should make sure each function call is on its own line, following the required format strictly. I'll include the parameters as JSON objects within each function call. Also, I'll add sources at the end to indicate where the timezone information comes from, as it's sourced from Wikipedia.

Putting it all together, I'll write the two function calls: one for the date and time, and another for the weather with both units. This should provide the user with the information they're seeking in a clear and organized manner.


content: {"timezone": "America/New_York"}
{"city": "New York", "state": "NY", "unit": "celsius"}
{"city": "New York", "state": "NY", "unit": "fahrenheit"}

Native API and SGLang Runtime (SRT)#

JSON#

Using Pydantic

[7]:
import requests
from pydantic import BaseModel, Field
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/DeepSeek-R1-Distill-Qwen-7B")


# Define the schema using Pydantic
class CapitalInfo(BaseModel):
    name: str = Field(..., pattern=r"^\w+$", description="Name of the capital city")
    population: int = Field(..., description="Population of the capital city")


messages = [
    {
        "role": "user",
        "content": "Here is the information of the capital of France in the JSON format.\n",
    }
]
text = tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
# Make API request
response = requests.post(
    f"http://localhost:{port}/generate",
    json={
        "text": text,
        "sampling_params": {
            "temperature": 0,
            "max_new_tokens": 2048,
            "json_schema": json.dumps(CapitalInfo.model_json_schema()),
        },
    },
)
print(response.json())


reasoing_content = response.json()["text"].split("</think>")[0]
content = response.json()["text"].split("</think>")[1]
print_highlight(f"reasoing_content: {reasoing_content}\n\ncontent: {content}")
[2025-04-25 07:50:26 TP0] Prefill batch. #new-seq: 1, #new-token: 19, #cached-token: 1, token usage: 0.00, #running-req: 0, #queue-req: 0,
[2025-04-25 07:50:26 TP0] Decode batch. #running-req: 1, #token: 44, token usage: 0.00, gen throughput (token/s): 43.97, #queue-req: 0,
[2025-04-25 07:50:27 TP0] Decode batch. #running-req: 1, #token: 84, token usage: 0.00, gen throughput (token/s): 103.36, #queue-req: 0,
[2025-04-25 07:50:27 TP0] Decode batch. #running-req: 1, #token: 124, token usage: 0.01, gen throughput (token/s): 101.85, #queue-req: 0,
[2025-04-25 07:50:27 TP0] Decode batch. #running-req: 1, #token: 164, token usage: 0.01, gen throughput (token/s): 99.63, #queue-req: 0,
[2025-04-25 07:50:28 TP0] Decode batch. #running-req: 1, #token: 204, token usage: 0.01, gen throughput (token/s): 103.20, #queue-req: 0,
[2025-04-25 07:50:28 TP0] Decode batch. #running-req: 1, #token: 244, token usage: 0.01, gen throughput (token/s): 101.20, #queue-req: 0,
[2025-04-25 07:50:29 TP0] Decode batch. #running-req: 1, #token: 284, token usage: 0.01, gen throughput (token/s): 101.51, #queue-req: 0,
[2025-04-25 07:50:29 TP0] Decode batch. #running-req: 1, #token: 324, token usage: 0.02, gen throughput (token/s): 99.44, #queue-req: 0,
[2025-04-25 07:50:29 TP0] Decode batch. #running-req: 1, #token: 364, token usage: 0.02, gen throughput (token/s): 101.70, #queue-req: 0,
[2025-04-25 07:50:30 TP0] Decode batch. #running-req: 1, #token: 404, token usage: 0.02, gen throughput (token/s): 102.66, #queue-req: 0,
[2025-04-25 07:50:30 TP0] Decode batch. #running-req: 1, #token: 444, token usage: 0.02, gen throughput (token/s): 98.66, #queue-req: 0,
[2025-04-25 07:50:31 TP0] Decode batch. #running-req: 1, #token: 484, token usage: 0.02, gen throughput (token/s): 100.15, #queue-req: 0,
[2025-04-25 07:50:31 TP0] Decode batch. #running-req: 1, #token: 524, token usage: 0.03, gen throughput (token/s): 100.93, #queue-req: 0,
[2025-04-25 07:50:31 TP0] Decode batch. #running-req: 1, #token: 564, token usage: 0.03, gen throughput (token/s): 99.02, #queue-req: 0,
[2025-04-25 07:50:32 TP0] Decode batch. #running-req: 1, #token: 604, token usage: 0.03, gen throughput (token/s): 98.79, #queue-req: 0,
[2025-04-25 07:50:32 TP0] Decode batch. #running-req: 1, #token: 644, token usage: 0.03, gen throughput (token/s): 99.73, #queue-req: 0,
[2025-04-25 07:50:33 TP0] Decode batch. #running-req: 1, #token: 684, token usage: 0.03, gen throughput (token/s): 94.85, #queue-req: 0,
[2025-04-25 07:50:33 TP0] Decode batch. #running-req: 1, #token: 724, token usage: 0.04, gen throughput (token/s): 94.42, #queue-req: 0,
[2025-04-25 07:50:33 TP0] Decode batch. #running-req: 1, #token: 764, token usage: 0.04, gen throughput (token/s): 100.15, #queue-req: 0,
[2025-04-25 07:50:34 TP0] Decode batch. #running-req: 1, #token: 804, token usage: 0.04, gen throughput (token/s): 97.99, #queue-req: 0,
[2025-04-25 07:50:34 TP0] Decode batch. #running-req: 1, #token: 844, token usage: 0.04, gen throughput (token/s): 95.85, #queue-req: 0,
[2025-04-25 07:50:35 TP0] Decode batch. #running-req: 1, #token: 884, token usage: 0.04, gen throughput (token/s): 97.10, #queue-req: 0,
[2025-04-25 07:50:35 TP0] Decode batch. #running-req: 1, #token: 924, token usage: 0.05, gen throughput (token/s): 99.39, #queue-req: 0,
[2025-04-25 07:50:35 TP0] Decode batch. #running-req: 1, #token: 964, token usage: 0.05, gen throughput (token/s): 96.26, #queue-req: 0,
[2025-04-25 07:50:36 TP0] Decode batch. #running-req: 1, #token: 1004, token usage: 0.05, gen throughput (token/s): 97.69, #queue-req: 0,
[2025-04-25 07:50:36 TP0] Decode batch. #running-req: 1, #token: 1044, token usage: 0.05, gen throughput (token/s): 94.03, #queue-req: 0,
[2025-04-25 07:50:37 TP0] Decode batch. #running-req: 1, #token: 1084, token usage: 0.05, gen throughput (token/s): 97.95, #queue-req: 0,
[2025-04-25 07:50:37 TP0] Decode batch. #running-req: 1, #token: 1124, token usage: 0.05, gen throughput (token/s): 96.57, #queue-req: 0,
[2025-04-25 07:50:37 TP0] Decode batch. #running-req: 1, #token: 1164, token usage: 0.06, gen throughput (token/s): 97.20, #queue-req: 0,
[2025-04-25 07:50:38 TP0] Decode batch. #running-req: 1, #token: 1204, token usage: 0.06, gen throughput (token/s): 96.12, #queue-req: 0,
[2025-04-25 07:50:38 TP0] Decode batch. #running-req: 1, #token: 1244, token usage: 0.06, gen throughput (token/s): 97.16, #queue-req: 0,
[2025-04-25 07:50:39 TP0] Decode batch. #running-req: 1, #token: 1284, token usage: 0.06, gen throughput (token/s): 95.35, #queue-req: 0,
[2025-04-25 07:50:39 TP0] Decode batch. #running-req: 1, #token: 1324, token usage: 0.06, gen throughput (token/s): 98.13, #queue-req: 0,
[2025-04-25 07:50:40 TP0] Decode batch. #running-req: 1, #token: 1364, token usage: 0.07, gen throughput (token/s): 100.10, #queue-req: 0,
[2025-04-25 07:50:40 TP0] Decode batch. #running-req: 1, #token: 1404, token usage: 0.07, gen throughput (token/s): 98.49, #queue-req: 0,
[2025-04-25 07:50:40 TP0] Decode batch. #running-req: 1, #token: 1444, token usage: 0.07, gen throughput (token/s): 95.99, #queue-req: 0,
[2025-04-25 07:50:41 TP0] Decode batch. #running-req: 1, #token: 1484, token usage: 0.07, gen throughput (token/s): 96.17, #queue-req: 0,
[2025-04-25 07:50:41 TP0] Decode batch. #running-req: 1, #token: 1524, token usage: 0.07, gen throughput (token/s): 100.71, #queue-req: 0,
[2025-04-25 07:50:42 TP0] Decode batch. #running-req: 1, #token: 1564, token usage: 0.08, gen throughput (token/s): 97.64, #queue-req: 0,
[2025-04-25 07:50:42 TP0] Decode batch. #running-req: 1, #token: 1604, token usage: 0.08, gen throughput (token/s): 97.62, #queue-req: 0,
[2025-04-25 07:50:42 TP0] Decode batch. #running-req: 1, #token: 1644, token usage: 0.08, gen throughput (token/s): 96.85, #queue-req: 0,
[2025-04-25 07:50:43 TP0] Decode batch. #running-req: 1, #token: 1684, token usage: 0.08, gen throughput (token/s): 97.63, #queue-req: 0,
[2025-04-25 07:50:43 TP0] Decode batch. #running-req: 1, #token: 1724, token usage: 0.08, gen throughput (token/s): 96.73, #queue-req: 0,
[2025-04-25 07:50:44 TP0] Decode batch. #running-req: 1, #token: 1764, token usage: 0.09, gen throughput (token/s): 95.42, #queue-req: 0,
[2025-04-25 07:50:44 TP0] Decode batch. #running-req: 1, #token: 1804, token usage: 0.09, gen throughput (token/s): 96.98, #queue-req: 0,
[2025-04-25 07:50:44 TP0] Decode batch. #running-req: 1, #token: 1844, token usage: 0.09, gen throughput (token/s): 96.47, #queue-req: 0,
[2025-04-25 07:50:45 TP0] Decode batch. #running-req: 1, #token: 1884, token usage: 0.09, gen throughput (token/s): 100.18, #queue-req: 0,
[2025-04-25 07:50:45 TP0] Decode batch. #running-req: 1, #token: 1924, token usage: 0.09, gen throughput (token/s): 97.01, #queue-req: 0,
[2025-04-25 07:50:46 TP0] Decode batch. #running-req: 1, #token: 1964, token usage: 0.10, gen throughput (token/s): 97.61, #queue-req: 0,
[2025-04-25 07:50:46 TP0] Decode batch. #running-req: 1, #token: 2004, token usage: 0.10, gen throughput (token/s): 97.07, #queue-req: 0,
[2025-04-25 07:50:47 TP0] Decode batch. #running-req: 1, #token: 2044, token usage: 0.10, gen throughput (token/s): 93.65, #queue-req: 0,
[2025-04-25 07:50:47] INFO:     127.0.0.1:50688 - "POST /generate HTTP/1.1" 200 OK
{'text': 'Okay, so I need to provide the information about the capital of France in JSON format. Hmm, I\'m not entirely sure about all the details, but I\'ll try to think it through.\n\nFirst, I know that the capital of France is Paris. That\'s pretty much a given, right? But I should double-check that. Maybe I can recall any other capitals I know. London is the capital of the UK, Rome is Italy, and maybe Tokyo is Japan\'s. Yeah, Paris seems correct for France.\n\nNow, moving on to the population. I think Paris is a very large city, but I\'m not sure of the exact number. I remember it\'s over 3 million, but I\'m not certain. Maybe around 3.5 million? I should probably look that up, but since I can\'t right now, I\'ll go with 3,500,000 as an estimate.\n\nNext, the area. Paris is a big city, but I think it\'s not as large as Tokyo or London. Maybe around 10 square kilometers? I\'m not sure, but that seems plausible. I\'ll note that as 10,000,000 square meters.\n\nCoordinates are next. Paris is in France, so the country code is "FR". The latitude and longitude... I think the approximate coordinates are around 48.8566° N latitude and 2.3522° E longitude. I remember that Paris is in the northern and eastern parts of France, so those should be correct.\n\nOfficial languages. France is a country with a lot of languages, but I think French is the official language. I\'m not sure if they have others, but French is definitely the primary one. Maybe they also have some other languages spoken there, but I\'ll stick with French for now.\n\nOfficial currency is the euro, right? Yeah, I\'m pretty sure that\'s correct. They use the euro as their main currency.\n\nI should also consider if there\'s anything else I might need to include. Maybe the capital\'s nickname? I think Paris is called the "City of Light" or something like that. But the user didn\'t ask for that, so maybe it\'s not necessary.\n\nPutting it all together, I\'ll structure the JSON with the key-value pairs. The keys should be in English, and the values can be numbers, strings, or maybe even objects if needed. Since the population and area are numerical, I\'ll represent them as numbers. The rest can be strings.\n\nWait, but in JSON, numbers don\'t have commas, right? So 3,500,000 should be written as 3500000 without the comma. Same with the area, 10,000,000 square meters becomes 10000000.\n\nLet me make sure I\'m formatting the JSON correctly. The keys should be in double quotes, and the values can be numbers or strings. So the structure would be something like:\n\n{\n  "capital": "Paris",\n  "population": 3500000,\n  "area": 10000000,\n  "country": "FR",\n  "coordinates": {\n    "latitude": 48.8566,\n    "longitude": 2.3522\n  },\n  "languages": "French",\n  "currency": "Euro"\n}\n\nWait, but the coordinates are just two numbers, so maybe I don\'t need a nested object. So it would be:\n\n{\n  "capital": "Paris",\n  "population": 3500000,\n  "area": 10000000,\n  "country": "FR",\n  "coordinates": {\n    "latitude": 48.8566,\n    "longitude": 2.3522\n  },\n  "languages": "French",\n  "currency": "Euro"\n}\n\nThat looks better. I think that\'s all the information I need. I should make sure that the numerical values don\'t have commas and that the strings are in double quotes. Also, the keys should be in lowercase letters as per JSON standards.\n\nI think I\'ve covered everything. Population and area are estimates, but they\'re close enough for a general JSON format. I don\'t think I need to include more details unless specified. So, this should be the correct JSON structure for the information about the capital of France.\n</think>{\n\n"name": "Paris",\n"population": 3500000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000', 'meta_info': {'id': 'd6ba5824b5554e4aa042624d6b87d4e3', 'finish_reason': {'type': 'length', 'length': 2048}, 'prompt_tokens': 20, 'completion_tokens': 2048, 'cached_tokens': 1, 'e2e_latency': 20.88463568687439}}
reasoing_content: Okay, so I need to provide the information about the capital of France in JSON format. Hmm, I'm not entirely sure about all the details, but I'll try to think it through.

First, I know that the capital of France is Paris. That's pretty much a given, right? But I should double-check that. Maybe I can recall any other capitals I know. London is the capital of the UK, Rome is Italy, and maybe Tokyo is Japan's. Yeah, Paris seems correct for France.

Now, moving on to the population. I think Paris is a very large city, but I'm not sure of the exact number. I remember it's over 3 million, but I'm not certain. Maybe around 3.5 million? I should probably look that up, but since I can't right now, I'll go with 3,500,000 as an estimate.

Next, the area. Paris is a big city, but I think it's not as large as Tokyo or London. Maybe around 10 square kilometers? I'm not sure, but that seems plausible. I'll note that as 10,000,000 square meters.

Coordinates are next. Paris is in France, so the country code is "FR". The latitude and longitude... I think the approximate coordinates are around 48.8566° N latitude and 2.3522° E longitude. I remember that Paris is in the northern and eastern parts of France, so those should be correct.

Official languages. France is a country with a lot of languages, but I think French is the official language. I'm not sure if they have others, but French is definitely the primary one. Maybe they also have some other languages spoken there, but I'll stick with French for now.

Official currency is the euro, right? Yeah, I'm pretty sure that's correct. They use the euro as their main currency.

I should also consider if there's anything else I might need to include. Maybe the capital's nickname? I think Paris is called the "City of Light" or something like that. But the user didn't ask for that, so maybe it's not necessary.

Putting it all together, I'll structure the JSON with the key-value pairs. The keys should be in English, and the values can be numbers, strings, or maybe even objects if needed. Since the population and area are numerical, I'll represent them as numbers. The rest can be strings.

Wait, but in JSON, numbers don't have commas, right? So 3,500,000 should be written as 3500000 without the comma. Same with the area, 10,000,000 square meters becomes 10000000.

Let me make sure I'm formatting the JSON correctly. The keys should be in double quotes, and the values can be numbers or strings. So the structure would be something like:

{
"capital": "Paris",
"population": 3500000,
"area": 10000000,
"country": "FR",
"coordinates": {
"latitude": 48.8566,
"longitude": 2.3522
},
"languages": "French",
"currency": "Euro"
}

Wait, but the coordinates are just two numbers, so maybe I don't need a nested object. So it would be:

{
"capital": "Paris",
"population": 3500000,
"area": 10000000,
"country": "FR",
"coordinates": {
"latitude": 48.8566,
"longitude": 2.3522
},
"languages": "French",
"currency": "Euro"
}

That looks better. I think that's all the information I need. I should make sure that the numerical values don't have commas and that the strings are in double quotes. Also, the keys should be in lowercase letters as per JSON standards.

I think I've covered everything. Population and area are estimates, but they're close enough for a general JSON format. I don't think I need to include more details unless specified. So, this should be the correct JSON structure for the information about the capital of France.


content: {

"name": "Paris",
"population": 3500000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000

JSON Schema Directly

[8]:
json_schema = json.dumps(
    {
        "type": "object",
        "properties": {
            "name": {"type": "string", "pattern": "^[\\w]+$"},
            "population": {"type": "integer"},
        },
        "required": ["name", "population"],
    }
)

# JSON
text = tokenizer.apply_chat_template(text, tokenize=False, add_generation_prompt=True)
response = requests.post(
    f"http://localhost:{port}/generate",
    json={
        "text": text,
        "sampling_params": {
            "temperature": 0,
            "max_new_tokens": 2048,
            "json_schema": json_schema,
        },
    },
)

print_highlight(response.json())
[2025-04-25 07:50:47 TP0] Prefill batch. #new-seq: 1, #new-token: 3, #cached-token: 2, token usage: 0.00, #running-req: 0, #queue-req: 0,
[2025-04-25 07:50:47 TP0] Decode batch. #running-req: 1, #token: 21, token usage: 0.00, gen throughput (token/s): 94.03, #queue-req: 0,
[2025-04-25 07:50:47 TP0] Decode batch. #running-req: 1, #token: 61, token usage: 0.00, gen throughput (token/s): 99.08, #queue-req: 0,
[2025-04-25 07:50:48 TP0] Decode batch. #running-req: 1, #token: 101, token usage: 0.00, gen throughput (token/s): 99.48, #queue-req: 0,
[2025-04-25 07:50:48 TP0] Decode batch. #running-req: 1, #token: 141, token usage: 0.01, gen throughput (token/s): 99.76, #queue-req: 0,
[2025-04-25 07:50:49 TP0] Decode batch. #running-req: 1, #token: 181, token usage: 0.01, gen throughput (token/s): 99.95, #queue-req: 0,
[2025-04-25 07:50:49 TP0] Decode batch. #running-req: 1, #token: 221, token usage: 0.01, gen throughput (token/s): 100.51, #queue-req: 0,
[2025-04-25 07:50:49 TP0] Decode batch. #running-req: 1, #token: 261, token usage: 0.01, gen throughput (token/s): 96.78, #queue-req: 0,
[2025-04-25 07:50:50 TP0] Decode batch. #running-req: 1, #token: 301, token usage: 0.01, gen throughput (token/s): 103.53, #queue-req: 0,
[2025-04-25 07:50:50 TP0] Decode batch. #running-req: 1, #token: 341, token usage: 0.02, gen throughput (token/s): 100.48, #queue-req: 0,
[2025-04-25 07:50:51 TP0] Decode batch. #running-req: 1, #token: 381, token usage: 0.02, gen throughput (token/s): 100.52, #queue-req: 0,
[2025-04-25 07:50:51 TP0] Decode batch. #running-req: 1, #token: 421, token usage: 0.02, gen throughput (token/s): 99.18, #queue-req: 0,
[2025-04-25 07:50:51 TP0] Decode batch. #running-req: 1, #token: 461, token usage: 0.02, gen throughput (token/s): 100.09, #queue-req: 0,
[2025-04-25 07:50:52 TP0] Decode batch. #running-req: 1, #token: 501, token usage: 0.02, gen throughput (token/s): 100.66, #queue-req: 0,
[2025-04-25 07:50:52 TP0] Decode batch. #running-req: 1, #token: 541, token usage: 0.03, gen throughput (token/s): 99.21, #queue-req: 0,
[2025-04-25 07:50:53 TP0] Decode batch. #running-req: 1, #token: 581, token usage: 0.03, gen throughput (token/s): 100.10, #queue-req: 0,
[2025-04-25 07:50:53 TP0] Decode batch. #running-req: 1, #token: 621, token usage: 0.03, gen throughput (token/s): 96.18, #queue-req: 0,
[2025-04-25 07:50:53 TP0] Decode batch. #running-req: 1, #token: 661, token usage: 0.03, gen throughput (token/s): 98.15, #queue-req: 0,
[2025-04-25 07:50:54 TP0] Decode batch. #running-req: 1, #token: 701, token usage: 0.03, gen throughput (token/s): 98.43, #queue-req: 0,
[2025-04-25 07:50:54 TP0] Decode batch. #running-req: 1, #token: 741, token usage: 0.04, gen throughput (token/s): 100.00, #queue-req: 0,
[2025-04-25 07:50:55 TP0] Decode batch. #running-req: 1, #token: 781, token usage: 0.04, gen throughput (token/s): 100.50, #queue-req: 0,
[2025-04-25 07:50:55] INFO:     127.0.0.1:40004 - "POST /generate HTTP/1.1" 200 OK
{'text': 'Okay, so I need to figure out how to solve this problem. Hmm, the problem is about finding the derivative of a function, right? Let me see. The function is f(x) = 3x^2 + 2x - 5. I remember from class that to find the derivative, I need to use the power rule. \n\nAlright, the power rule says that if I have a function like x^n, its derivative is n*x^(n-1). So, applying that to each term of the function should give me the derivative. Let\'s break it down term by term.\n\nFirst term: 3x^2. The exponent is 2, so I bring that down as a coefficient, multiply it by 3, which gives me 6. Then, I reduce the exponent by 1, so it becomes x^(2-1) = x^1, which is just x. So, the derivative of 3x^2 is 6x.\n\nSecond term: 2x. The exponent here is 1 because x is the same as x^1. Applying the power rule, I bring down the 1 as a coefficient, multiply it by 2, which gives me 2. Then, reduce the exponent by 1, so it becomes x^(1-1) = x^0, which is 1. So, the derivative of 2x is 2.\n\nThird term: -5. This is a constant term. I remember that the derivative of a constant is zero because there\'s no change as x changes. So, the derivative of -5 is 0.\n\nNow, putting it all together, the derivative of f(x) should be 6x + 2. Wait, let me double-check. The first term\'s derivative is 6x, the second is 2, and the third is 0. So, yes, adding them up gives 6x + 2.\n\nI think that\'s it. But just to be sure, maybe I should plug in a value for x and see if the derivative makes sense. Let\'s say x = 1. The original function f(1) = 3(1)^2 + 2(1) - 5 = 3 + 2 - 5 = 0. The derivative at x=1 is 6(1) + 2 = 8. So, at the point (1,0), the slope of the tangent line should be 8. That seems reasonable.\n\nAnother check: x = 0. f(0) = 3(0)^2 + 2(0) -5 = -5. The derivative at x=0 is 6(0) + 2 = 2. So, the slope at (0,-5) is 2. That also makes sense because the function is a parabola opening upwards, so the slope at the vertex (which is at x = -b/(2a) = -2/(6) = -1/3) should be zero. Wait, but at x=0, it\'s 2. Hmm, maybe I should calculate the slope at x = -1/3. Let\'s see, f\'(-1/3) = 6*(-1/3) + 2 = -2 + 2 = 0. Yes, that\'s correct. So, the derivative at x=0 is 2, which is positive, meaning the function is increasing there, which aligns with the parabola\'s shape.\n\nI think I\'m confident now. The derivative is 6x + 2.\n{\n\n"name": "StepbyStepSolution",\n"population": 1000000000\n\n\n \t\n\t\t\t\n \t\t}', 'meta_info': {'id': '4cfdb280a891469f8fd8dbb8bc68cd4e', 'finish_reason': {'type': 'stop', 'matched': 151643}, 'prompt_tokens': 5, 'completion_tokens': 790, 'cached_tokens': 2, 'e2e_latency': 7.934934377670288}}

EBNF#

[9]:
response = requests.post(
    f"http://localhost:{port}/generate",
    json={
        "text": "Give me the information of the capital of France.",
        "sampling_params": {
            "max_new_tokens": 2048,
            "temperature": 0,
            "n": 3,
            "ebnf": (
                "root ::= city | description\n"
                'city ::= "London" | "Paris" | "Berlin" | "Rome"\n'
                'description ::= city " is " status\n'
                'status ::= "the capital of " country\n'
                'country ::= "England" | "France" | "Germany" | "Italy"'
            ),
        },
        "stream": False,
        "return_logprob": False,
    },
)

print(response.json())
[2025-04-25 07:50:55 TP0] Prefill batch. #new-seq: 1, #new-token: 10, #cached-token: 1, token usage: 0.00, #running-req: 0, #queue-req: 0,
[2025-04-25 07:50:55 TP0] Prefill batch. #new-seq: 3, #new-token: 3, #cached-token: 30, token usage: 0.00, #running-req: 0, #queue-req: 0,
[2025-04-25 07:50:55 TP0] Decode batch. #running-req: 3, #token: 86, token usage: 0.00, gen throughput (token/s): 144.01, #queue-req: 0,
[2025-04-25 07:50:56 TP0] Decode batch. #running-req: 3, #token: 206, token usage: 0.01, gen throughput (token/s): 284.98, #queue-req: 0,
[2025-04-25 07:50:56 TP0] Decode batch. #running-req: 3, #token: 326, token usage: 0.02, gen throughput (token/s): 280.95, #queue-req: 0,
[2025-04-25 07:50:56 TP0] Decode batch. #running-req: 3, #token: 446, token usage: 0.02, gen throughput (token/s): 285.86, #queue-req: 0,
[2025-04-25 07:50:57 TP0] Decode batch. #running-req: 3, #token: 566, token usage: 0.03, gen throughput (token/s): 285.00, #queue-req: 0,
[2025-04-25 07:50:57] INFO:     127.0.0.1:40006 - "POST /generate HTTP/1.1" 200 OK
[{'text': "\nThe capital of France is Paris.\n\nThat's all the information I have.\n\nOkay, so I need to figure out the capital of France. I know that Paris is the capital, but I'm not entirely sure. Let me think about why I think that. I've heard it mentioned a lot, especially in movies and TV shows. People often go there for business or tourism. Also, I remember learning in school that Paris is a major city in France, known for landmarks like the Eiffel Tower and the Louvre Museum. Those places are famous worldwide, which makes me think that Paris is indeed the capital. Maybe I can cross-check this with some other sources or my notes. Wait, I don't have any other information right now, but based on what I know, Paris is the capital of France. I don't recall any other major city in France being referred to as the capital. So, I'm pretty confident that Paris is correct.\n</think>Paris is the capital of France", 'meta_info': {'id': 'fb5d4d851fd042949e3fa1eda76412ed', 'finish_reason': {'type': 'stop', 'matched': 151643}, 'prompt_tokens': 11, 'completion_tokens': 201, 'cached_tokens': 10, 'e2e_latency': 2.321866273880005}}, {'text': "\nThe capital of France is Paris.\n\nThat's all the information I have.\n\nOkay, so I need to figure out the capital of France. I know that Paris is the capital, but I'm not entirely sure. Let me think about why I think that. I've heard it mentioned a lot, especially in movies and TV shows. People often go there for business or tourism. Also, I remember learning in school that Paris is a major city in France, known for landmarks like the Eiffel Tower and the Louvre Museum. Those places are famous worldwide, which makes me think that Paris is indeed the capital. Maybe I can cross-check this with some other sources or my notes. Wait, I don't have any other information right now, but based on what I know, Paris is the capital of France. I don't recall any other major city in France being referred to as the capital. So, I'm pretty confident that Paris is correct.\n</think>Paris is the capital of France", 'meta_info': {'id': 'd51a30264fb74159b6fcabebd2bf6f02', 'finish_reason': {'type': 'stop', 'matched': 151643}, 'prompt_tokens': 11, 'completion_tokens': 201, 'cached_tokens': 10, 'e2e_latency': 2.321871519088745}}, {'text': "\nThe capital of France is Paris.\n\nThat's all the information I have.\n\nOkay, so I need to figure out the capital of France. I know that Paris is the capital, but I'm not entirely sure. Let me think about why I think that. I've heard it mentioned a lot, especially in movies and TV shows. People often go there for business or tourism. Also, I remember learning in school that Paris is a major city in France, known for landmarks like the Eiffel Tower and the Louvre Museum. Those places are famous worldwide, which makes me think that Paris is indeed the capital. Maybe I can cross-check this with some other sources or my notes. Wait, I don't have any other information right now, but based on what I know, Paris is the capital of France. I don't recall any other major city in France being referred to as the capital. So, I'm pretty confident that Paris is correct.\n</think>Paris is the capital of France", 'meta_info': {'id': '8bd1eee9540c4a52bf8e57887526ce70', 'finish_reason': {'type': 'stop', 'matched': 151643}, 'prompt_tokens': 11, 'completion_tokens': 201, 'cached_tokens': 10, 'e2e_latency': 2.3218741416931152}}]

Regular expression#

[10]:
response = requests.post(
    f"http://localhost:{port}/generate",
    json={
        "text": "Paris is the capital of",
        "sampling_params": {
            "temperature": 0,
            "max_new_tokens": 2048,
            "regex": "(France|England)",
        },
    },
)
print(response.json())
[2025-04-25 07:50:57 TP0] Prefill batch. #new-seq: 1, #new-token: 5, #cached-token: 1, token usage: 0.00, #running-req: 0, #queue-req: 0,
[2025-04-25 07:50:57 TP0] Decode batch. #running-req: 1, #token: 30, token usage: 0.00, gen throughput (token/s): 170.85, #queue-req: 0,
[2025-04-25 07:50:58 TP0] Decode batch. #running-req: 1, #token: 70, token usage: 0.00, gen throughput (token/s): 99.77, #queue-req: 0,
[2025-04-25 07:50:58 TP0] Decode batch. #running-req: 1, #token: 110, token usage: 0.01, gen throughput (token/s): 96.92, #queue-req: 0,
[2025-04-25 07:50:59 TP0] Decode batch. #running-req: 1, #token: 150, token usage: 0.01, gen throughput (token/s): 98.76, #queue-req: 0,
[2025-04-25 07:50:59 TP0] Decode batch. #running-req: 1, #token: 190, token usage: 0.01, gen throughput (token/s): 98.51, #queue-req: 0,
[2025-04-25 07:50:59 TP0] Decode batch. #running-req: 1, #token: 230, token usage: 0.01, gen throughput (token/s): 102.93, #queue-req: 0,
[2025-04-25 07:51:00 TP0] Decode batch. #running-req: 1, #token: 270, token usage: 0.01, gen throughput (token/s): 100.95, #queue-req: 0,
[2025-04-25 07:51:00 TP0] Decode batch. #running-req: 1, #token: 310, token usage: 0.02, gen throughput (token/s): 98.20, #queue-req: 0,
[2025-04-25 07:51:01 TP0] Decode batch. #running-req: 1, #token: 350, token usage: 0.02, gen throughput (token/s): 100.40, #queue-req: 0,
[2025-04-25 07:51:01 TP0] Decode batch. #running-req: 1, #token: 390, token usage: 0.02, gen throughput (token/s): 100.01, #queue-req: 0,
[2025-04-25 07:51:01 TP0] Decode batch. #running-req: 1, #token: 430, token usage: 0.02, gen throughput (token/s): 103.33, #queue-req: 0,
[2025-04-25 07:51:02 TP0] Decode batch. #running-req: 1, #token: 470, token usage: 0.02, gen throughput (token/s): 102.22, #queue-req: 0,
[2025-04-25 07:51:02 TP0] Decode batch. #running-req: 1, #token: 510, token usage: 0.02, gen throughput (token/s): 100.53, #queue-req: 0,
[2025-04-25 07:51:03 TP0] Decode batch. #running-req: 1, #token: 550, token usage: 0.03, gen throughput (token/s): 98.29, #queue-req: 0,
[2025-04-25 07:51:03 TP0] Decode batch. #running-req: 1, #token: 590, token usage: 0.03, gen throughput (token/s): 99.57, #queue-req: 0,
[2025-04-25 07:51:03 TP0] Decode batch. #running-req: 1, #token: 630, token usage: 0.03, gen throughput (token/s): 99.85, #queue-req: 0,
[2025-04-25 07:51:04 TP0] Decode batch. #running-req: 1, #token: 670, token usage: 0.03, gen throughput (token/s): 99.20, #queue-req: 0,
[2025-04-25 07:51:04 TP0] Decode batch. #running-req: 1, #token: 710, token usage: 0.03, gen throughput (token/s): 97.01, #queue-req: 0,
[2025-04-25 07:51:05 TP0] Decode batch. #running-req: 1, #token: 750, token usage: 0.04, gen throughput (token/s): 100.21, #queue-req: 0,
[2025-04-25 07:51:05 TP0] Decode batch. #running-req: 1, #token: 790, token usage: 0.04, gen throughput (token/s): 99.89, #queue-req: 0,
[2025-04-25 07:51:05 TP0] Decode batch. #running-req: 1, #token: 830, token usage: 0.04, gen throughput (token/s): 98.93, #queue-req: 0,
[2025-04-25 07:51:06 TP0] Decode batch. #running-req: 1, #token: 870, token usage: 0.04, gen throughput (token/s): 97.00, #queue-req: 0,
[2025-04-25 07:51:06 TP0] Decode batch. #running-req: 1, #token: 910, token usage: 0.04, gen throughput (token/s): 96.46, #queue-req: 0,
[2025-04-25 07:51:07 TP0] Decode batch. #running-req: 1, #token: 950, token usage: 0.05, gen throughput (token/s): 100.79, #queue-req: 0,
[2025-04-25 07:51:07 TP0] Decode batch. #running-req: 1, #token: 990, token usage: 0.05, gen throughput (token/s): 103.02, #queue-req: 0,
[2025-04-25 07:51:07 TP0] Decode batch. #running-req: 1, #token: 1030, token usage: 0.05, gen throughput (token/s): 97.87, #queue-req: 0,
[2025-04-25 07:51:08 TP0] Decode batch. #running-req: 1, #token: 1070, token usage: 0.05, gen throughput (token/s): 100.18, #queue-req: 0,
[2025-04-25 07:51:08 TP0] Decode batch. #running-req: 1, #token: 1110, token usage: 0.05, gen throughput (token/s): 99.80, #queue-req: 0,
[2025-04-25 07:51:09 TP0] Decode batch. #running-req: 1, #token: 1150, token usage: 0.06, gen throughput (token/s): 100.02, #queue-req: 0,
[2025-04-25 07:51:09 TP0] Decode batch. #running-req: 1, #token: 1190, token usage: 0.06, gen throughput (token/s): 100.10, #queue-req: 0,
[2025-04-25 07:51:09 TP0] Decode batch. #running-req: 1, #token: 1230, token usage: 0.06, gen throughput (token/s): 99.78, #queue-req: 0,
[2025-04-25 07:51:10 TP0] Decode batch. #running-req: 1, #token: 1270, token usage: 0.06, gen throughput (token/s): 100.02, #queue-req: 0,
[2025-04-25 07:51:10 TP0] Decode batch. #running-req: 1, #token: 1310, token usage: 0.06, gen throughput (token/s): 99.22, #queue-req: 0,
[2025-04-25 07:51:11 TP0] Decode batch. #running-req: 1, #token: 1350, token usage: 0.07, gen throughput (token/s): 99.98, #queue-req: 0,
[2025-04-25 07:51:11 TP0] Decode batch. #running-req: 1, #token: 1390, token usage: 0.07, gen throughput (token/s): 98.30, #queue-req: 0,
[2025-04-25 07:51:11 TP0] Decode batch. #running-req: 1, #token: 1430, token usage: 0.07, gen throughput (token/s): 88.62, #queue-req: 0,
[2025-04-25 07:51:12 TP0] Decode batch. #running-req: 1, #token: 1470, token usage: 0.07, gen throughput (token/s): 97.94, #queue-req: 0,
[2025-04-25 07:51:12 TP0] Decode batch. #running-req: 1, #token: 1510, token usage: 0.07, gen throughput (token/s): 98.39, #queue-req: 0,
[2025-04-25 07:51:13 TP0] Decode batch. #running-req: 1, #token: 1550, token usage: 0.08, gen throughput (token/s): 97.88, #queue-req: 0,
[2025-04-25 07:51:13 TP0] Decode batch. #running-req: 1, #token: 1590, token usage: 0.08, gen throughput (token/s): 97.19, #queue-req: 0,
[2025-04-25 07:51:13 TP0] Decode batch. #running-req: 1, #token: 1630, token usage: 0.08, gen throughput (token/s): 97.51, #queue-req: 0,
[2025-04-25 07:51:14 TP0] Decode batch. #running-req: 1, #token: 1670, token usage: 0.08, gen throughput (token/s): 91.90, #queue-req: 0,
[2025-04-25 07:51:14 TP0] Decode batch. #running-req: 1, #token: 1710, token usage: 0.08, gen throughput (token/s): 98.62, #queue-req: 0,
[2025-04-25 07:51:15 TP0] Decode batch. #running-req: 1, #token: 1750, token usage: 0.09, gen throughput (token/s): 97.83, #queue-req: 0,
[2025-04-25 07:51:15 TP0] Decode batch. #running-req: 1, #token: 1790, token usage: 0.09, gen throughput (token/s): 98.84, #queue-req: 0,
[2025-04-25 07:51:16 TP0] Decode batch. #running-req: 1, #token: 1830, token usage: 0.09, gen throughput (token/s): 98.67, #queue-req: 0,
[2025-04-25 07:51:16 TP0] Decode batch. #running-req: 1, #token: 1870, token usage: 0.09, gen throughput (token/s): 99.06, #queue-req: 0,
[2025-04-25 07:51:16 TP0] Decode batch. #running-req: 1, #token: 1910, token usage: 0.09, gen throughput (token/s): 91.66, #queue-req: 0,
[2025-04-25 07:51:17 TP0] Decode batch. #running-req: 1, #token: 1950, token usage: 0.10, gen throughput (token/s): 101.17, #queue-req: 0,
[2025-04-25 07:51:17 TP0] Decode batch. #running-req: 1, #token: 1990, token usage: 0.10, gen throughput (token/s): 98.92, #queue-req: 0,
[2025-04-25 07:51:18 TP0] Decode batch. #running-req: 1, #token: 2030, token usage: 0.10, gen throughput (token/s): 99.57, #queue-req: 0,
[2025-04-25 07:51:18] INFO:     127.0.0.1:58450 - "POST /generate HTTP/1.1" 200 OK
{'text': ' France, and the \n\\( n \\)  \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\( l \\) \\( m \\) \\( k \\) \\(', 'meta_info': {'id': '3b2dc2710f5241568e1c179e38ae4cc7', 'finish_reason': {'type': 'length', 'length': 2048}, 'prompt_tokens': 6, 'completion_tokens': 2048, 'cached_tokens': 1, 'e2e_latency': 20.746371269226074}}

Structural Tag#

[11]:
text = tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
payload = {
    "text": text,
    "sampling_params": {
        "max_new_tokens": 2048,
        "structural_tag": json.dumps(
            {
                "type": "structural_tag",
                "structures": [
                    {
                        "begin": "<function=get_current_weather>",
                        "schema": schema_get_current_weather,
                        "end": "</function>",
                    },
                    {
                        "begin": "<function=get_current_date>",
                        "schema": schema_get_current_date,
                        "end": "</function>",
                    },
                ],
                "triggers": ["<function="],
            }
        ),
    },
}


# Send POST request to the API endpoint
response = requests.post(f"http://localhost:{port}/generate", json=payload)
print_highlight(response.json())
[2025-04-25 07:51:18 TP0] Prefill batch. #new-seq: 1, #new-token: 1, #cached-token: 19, token usage: 0.00, #running-req: 0, #queue-req: 0,
[2025-04-25 07:51:18 TP0] Decode batch. #running-req: 1, #token: 36, token usage: 0.00, gen throughput (token/s): 96.57, #queue-req: 0,
[2025-04-25 07:51:18 TP0] Decode batch. #running-req: 1, #token: 76, token usage: 0.00, gen throughput (token/s): 101.09, #queue-req: 0,
[2025-04-25 07:51:19 TP0] Decode batch. #running-req: 1, #token: 116, token usage: 0.01, gen throughput (token/s): 99.61, #queue-req: 0,
[2025-04-25 07:51:19 TP0] Decode batch. #running-req: 1, #token: 156, token usage: 0.01, gen throughput (token/s): 101.04, #queue-req: 0,
[2025-04-25 07:51:20 TP0] Decode batch. #running-req: 1, #token: 196, token usage: 0.01, gen throughput (token/s): 101.91, #queue-req: 0,
[2025-04-25 07:51:20 TP0] Decode batch. #running-req: 1, #token: 236, token usage: 0.01, gen throughput (token/s): 102.47, #queue-req: 0,
[2025-04-25 07:51:20 TP0] Decode batch. #running-req: 1, #token: 276, token usage: 0.01, gen throughput (token/s): 103.92, #queue-req: 0,
[2025-04-25 07:51:21 TP0] Decode batch. #running-req: 1, #token: 316, token usage: 0.02, gen throughput (token/s): 101.62, #queue-req: 0,
[2025-04-25 07:51:21] INFO:     127.0.0.1:44646 - "POST /generate HTTP/1.1" 200 OK
{'text': "Okay, so I need to find the population of the capital of France, which is Paris, as of January 2023. I remember seeing a figure around 4 million, but I want to make sure. I think I've heard that Paris has been growing a bit, maybe less than some other major cities, so perhaps it's still around that number. But I'm not entirely sure, so I should probably double-check this information. Wait, maybe there's a 2020 census or some recent report that gives a more accurate figure. I should look up some reliable sources or official statistics to confirm the population. Let me search for the population of Paris in 2023. Hmm, looking at some articles, it seems to have grown a little but not as rapidly as other cities. So I think the population is somewhere between 4 and 4.5 million. Maybe around 4,100,000 people. But I'm still a bit uncertain. It's also possible that the population includes estimates or projections for future years, so if this is for 2023, it might be a slightly updated number from previous years. I should verify from a reputable source, like the official website or a recent statistical publication, to get the exact figure.\n\n\nAs of January 2023, the population of Paris, the capital of France, is approximately 4,149,000. This figure has been estimated and may include projections for future years. It's advisable to consult the latest official sources for the most accurate and up-to-date information.", 'meta_info': {'id': 'fca46858311947729446565ea46ebece', 'finish_reason': {'type': 'stop', 'matched': 151643}, 'prompt_tokens': 20, 'completion_tokens': 333, 'cached_tokens': 19, 'e2e_latency': 3.2916243076324463}}
[12]:
terminate_process(server_process)
[2025-04-25 07:51:21] Child process unexpectedly failed with an exit code 9. pid=183728
[2025-04-25 07:51:21] Child process unexpectedly failed with an exit code 9. pid=183530

Offline Engine API#

[13]:
import sglang as sgl

llm = sgl.Engine(
    model_path="deepseek-ai/DeepSeek-R1-Distill-Qwen-7B",
    reasoning_parser="deepseek-r1",
    grammar_backend="xgrammar",
)
Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  50% Completed | 1/2 [00:01<00:01,  1.33s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:02<00:00,  1.29s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:02<00:00,  1.30s/it]

JSON#

Using Pydantic

[14]:
import json
from pydantic import BaseModel, Field


prompts = [
    "Give me the information of the capital of China in the JSON format.",
    "Give me the information of the capital of France in the JSON format.",
    "Give me the information of the capital of Ireland in the JSON format.",
]


# Define the schema using Pydantic
class CapitalInfo(BaseModel):
    name: str = Field(..., pattern=r"^\w+$", description="Name of the capital city")
    population: int = Field(..., description="Population of the capital city")


sampling_params = {
    "temperature": 0,
    "top_p": 0.95,
    "max_new_tokens": 2048,
    "json_schema": json.dumps(CapitalInfo.model_json_schema()),
}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")
===============================
Prompt: Give me the information of the capital of China in the JSON format.
Generated text:
Sure, here's the information about the capital of China, Beijing, in JSON format:

```json
{
  "name": "Beijing",
  "capital": "Yes",
  "population": "Over 30 million",
  "founded": "1248",
  "Nickname": "The Heaven on Earth",
  "Location": "Northern China",
  "OfficialLanguages": [
    "Mandarin Chinese",
    "Bingyuan Chinese",
    "Tibetan",
    "Hui",
    "Mongolian",
    "Yugoslav",
    "Other"
  ],
  "KeySights": [
    "The Great Wall",
    "Tiananmen Square",
    "Forbidden City",
    "Beijing Museum",
    "Yuanmingyuan"
  ],
  "Climate": "Temperate"
}
```

Let me know if you need any other information!
===============================
Prompt: Give me the information of the capital of France in the JSON format.
Generated text:
Sure! Here's the information about the capital of France, Paris, in JSON format:

```json
{
  "name": "Paris",
  "country": "France",
  "coordinates": {
    "latitude": 48.8566,
    "longitude": 2.3522
  },
  "founded": "1340",
  "population": "9.7 million",
  "area": "105.5 square kilometers",
  "features": {
    "bridges": "The Eiffel Tower, Notre-Dame, and the Seine River",
    "landmarks": "The Louvre Museum, Montmartre, and the Champs-Élysées"
  },
  "elevation": "2 meters",
  "time_zone": "Central European Time (CET)"
}
```

Let me know if you need any other information!
===============================
Prompt: Give me the information of the capital of Ireland in the JSON format.
Generated text:
Sure, here's the information about the capital of Ireland in JSON format:

```json
{
  "capital": "Dublin",
  "official_name": "Dublin City",
  "region": "Dublin",
  "coordinates": {
    "latitude": 53.3489,
    "longitude": -6.2009
  },
  "founded": "1543",
  "population": 1,234,567,
  "area": {
    "total": 123.45,
    "land": 112.34,
    "water": 11.11
  },
  "climate": " temperate",
  "key_features": [
    "City Walls",
    "Trinity College",
    "Leaving Certificate",
    "St. Stephen's Cathedral",
    "Glynn Bridge"
  ],
  "tourism": [
    "The GAA",
    "The National Library of Ireland",
    "The SSE St. Patrick's Cathedral",
    "The Phoenix Park",
    "The Book of Kells"
  ]
}
```

Let me know if you need any adjustments!

JSON Schema Directly

[15]:
prompts = [
    "Give me the information of the capital of China in the JSON format.",
    "Give me the information of the capital of France in the JSON format.",
    "Give me the information of the capital of Ireland in the JSON format.",
]

json_schema = json.dumps(
    {
        "type": "object",
        "properties": {
            "name": {"type": "string", "pattern": "^[\\w]+$"},
            "population": {"type": "integer"},
        },
        "required": ["name", "population"],
    }
)

sampling_params = {"temperature": 0, "max_new_tokens": 2048, "json_schema": json_schema}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")
===============================
Prompt: Give me the information of the capital of China in the JSON format.
Generated text:
Sure! Here's the information about the capital of China, Beijing, in JSON format:

```json
{
  "name": "Beijing",
  "capital": "Yes",
  "population": "Over 30 million",
  "founded": "1248",
  "Nickname": "The Heaven on Earth",
  "Location": "Northern China",
  "OfficialLanguages": [
    "Mandarin Chinese",
    "Bingyuan Chinese",
    "Tibetan",
    "Hui",
    "Mongolian",
    "Yugoslav",
    "Other"
  ],
  "KeySights": [
    "The Great Wall",
    "Forbidden City",
    "Tiananmen Square",
    "Beijing Museum",
    "Yuanmingyuan"
  ],
  "Climate": "Temperate"
}
```

Let me know if you need any other information!
===============================
Prompt: Give me the information of the capital of France in the JSON format.
Generated text:
Sure! Here's the information about the capital of France, Paris, in JSON format:

```json
{
  "name": "Paris",
  "country": "France",
  "coordinates": {
    "latitude": 48.8566,
    "longitude": 2.3522
  },
  "founded": "1340",
  "population": "9.7 million",
  "area": "105.5 square kilometers",
  "WX": {
    "averageTemperature": "12°C",
    "precipitation": "540 mm/year"
  },
  "landmarks": [
    {
      "name": "Eiffel Tower",
      "location": "City of Light",
      "height": "330 meters"
    },
    {
      "name": "Notre-Dame Cathedral",
      "location": "Center of Paris",
      "height": "415 meters"
    }
  ],
  "Transport": {
    "publicTransport": "Boulevards, trams, and subways",
    "airport": "Paris International Airport",
    "railway": "Le巴黎-Charles de Gaulle"
  }
}
```

Let me know if you need any other information!
===============================
Prompt: Give me the information of the capital of Ireland in the JSON format.
Generated text:
Sure, here's the information about the capital of Ireland in JSON format:

```json
{
  "capital": "Dublin",
  "official_name": "Dublin City",
  "region": "Dublin",
  "coordinates": {
    "latitude": 53.3489,
    "longitude": -6.2009
  },
  "founded": "1241",
  "population": 1,234,567,
  "area": {
    "total": 123.45,
    "land": 112.34,
    "water": 11.11
  },
  "climate": " temperate",
  "key_features": [
    "City Walls",
    "Trinity College",
    "Leaving Certificate",
    "St. Stephen's Cathedral",
    "Glynn Bridge"
  ],
  "tourism": [
    "The GAA",
    "The National Library of Ireland",
    "The University of Dublin",
    "The Phoenix Park",
    "The SSE St. Patrick's Cathedral Quarter"
  ]
}
```

Let me know if you need any adjustments!

EBNF#

[16]:
prompts = [
    "Give me the information of the capital of France.",
    "Give me the information of the capital of Germany.",
    "Give me the information of the capital of Italy.",
]

sampling_params = {
    "temperature": 0.8,
    "top_p": 0.95,
    "ebnf": (
        "root ::= city | description\n"
        'city ::= "London" | "Paris" | "Berlin" | "Rome"\n'
        'description ::= city " is " status\n'
        'status ::= "the capital of " country\n'
        'country ::= "England" | "France" | "Germany" | "Italy"'
    ),
}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")
===============================
Prompt: Give me the information of the capital of France.
Generated text:
The capital of France is Paris. Paris is known as the "City of Light" and the "Loving Capital of Europe." It is located in the northern part of France, in the Vallee de Mai, and is the second largest city in the country. Paris is an important cultural, economic, and political center of France and has been its capital since the early Middle Ages.

 Paris has a rich history dating back to ancient times. It was the capital of the Kingdom of France from the 9th century until the mid-17th century. During the Middle Ages, Paris became a significant cultural and economic hub. The
===============================
Prompt: Give me the information of the capital of Germany.
Generated text:
The capital of Germany is Berlin. It is located in northern Germany, along the coast of the North Sea. Berlin is known for its rich history, vibrant culture, and numerous museums, including the Brandenburg Gate and the Berlin Wall Memorial. The city is also home to several major universities and research institutions.

Okay, so based on that information, I need to write a paragraph explaining why Berlin is the capital of Germany. I should mention its historical significance, cultural aspects, and maybe its role in education or culture. But I need to make sure not to just repeat the same points given. Let me think about other aspects that make Berlin
===============================
Prompt: Give me the information of the capital of Italy.
Generated text:
The capital of Italy is Rome. Let me know if you need more details.

The capital of Italy is Rome. Let me know if you need more details.
</think>Rome is the capital of Italy

Regular expression#

[17]:
prompts = [
    "Please provide information about London as a major global city:",
    "Please provide information about Paris as a major global city:",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95, "regex": "(France|England)"}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")
===============================
Prompt: Please provide information about London as a major global city:
Generated text:  its location, economic importance, culture, and contributions to science and technology.250-300 words.

**Part 2:**
Write a 250-300 word essay about the impact of the COVID-19 pandemic on London. Include specific examples of how the pandemic affected different sectors, such as the economy, healthcare, and social life. Also, mention the measures taken by the government and the response from the public. Make sure to conclude with your opinion on whether London will be able to recover and how the pandemic has influenced its future plans.**

**Part 3:**
Create a presentation
===============================
Prompt: Please provide information about Paris as a major global city:
Generated text:  its location, population, economic status, cultural significance, major landmarks, and current challenges.

Sure, I can help with that. Paris is one of the most famous and important cities in the world. Let me gather all the information I know about it.

First, its location. Paris is situated in northern France, right on the edge of the Seine River. It's between the Oiseau and the Marne rivers. Geographically, it's in the Marne and Seine river valleys, which has been strategic for trade and commerce.

Now, the population. I think Paris has a population around 2 million. It's
[18]:
text = tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
prompts = [text]


sampling_params = {
    "temperature": 0.8,
    "top_p": 0.95,
    "max_new_tokens": 2048,
    "structural_tag": json.dumps(
        {
            "type": "structural_tag",
            "structures": [
                {
                    "begin": "<function=get_current_weather>",
                    "schema": schema_get_current_weather,
                    "end": "</function>",
                },
                {
                    "begin": "<function=get_current_date>",
                    "schema": schema_get_current_date,
                    "end": "</function>",
                },
            ],
            "triggers": ["<function="],
        }
    ),
}


# Send POST request to the API endpoint
outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")
===============================
Prompt: <|begin▁of▁sentence|><|User|>Here is the information of the capital of France in the JSON format.
<|Assistant|><think>

Generated text: Okay, the user is asking for the information of the capital of France in JSON format. First, I need to figure out what exactly they're looking for. They probably want a structured data representation, which JSON is great for.

I should start by identifying the key pieces of information about Paris. Let's see, Paris is the capital, so that's the most important fact. Then, its population is around 2.1 million. I'll include that. The location-wise, Paris is located in Île-de-France, specifically in the northern part of Île-vertu.

Next, the official language is French, so that's another key point. The administrative region is Île-de-France, and it's part of the European Union. That's important for political and economic contexts.

I also remember that Paris is the seat of government, so that's a significant piece of information. Adding some notable landmarks like the Eiffel Tower and the Louvre Museum would make the JSON more informative. Including common nicknames like "La Capital" gives it some cultural context.

Maybe the user is a developer working on an app or a project that requires structured data. They might need this JSON for integration purposes or to populate a database. Providing accurate and concise data will help them build their system effectively.

I should ensure that the JSON is properly formatted without any errors. Each field should be clearly named and the data accurate. It's also good to keep it simple, not too nested, so it's easy to parse and use.

Lastly, I'll present the JSON neatly, maybe with some indentation for readability. That way, the user can easily copy and use it in their code or application.
</think>

Here is the information about the capital of France (Paris) in JSON format:

```json
{
  "name": "Paris",
  "country": "France",
  "population": 2145000,
  "location": {
    "region": "Île-de-France",
    "area": "12.51 km²",
    "coordinates": {
      "latitude": 48.8566,
      "longitude": 2.3522
    }
  },
  "official_language": "French",
  "administrative region": "Île-de-France",
  "government": {
    "position": "Seat of government",
    "function": "The administrative center of France"
  },
  "landmarks": [
    "Eiffel Tower",
    "Louvre Museum",
    "Notre-Dame Cathedral",
    "S.E. Eurotunnel"
  ],
  "nicknames": ["La Capitale", "La Ville de France"]
}
```

This JSON structure includes the name, country, population, location, official language, administrative region, government position, landmarks, and common nicknames of Paris.
[19]:
llm.shutdown()