Tool and Function Calling#

This guide demonstrates how to use SGLang’s Funcion calling functionality.

OpenAI Compatible API#

Launching the Server#

[1]:
from openai import OpenAI
import json
from sglang.utils import wait_for_server, print_highlight, terminate_process
from sglang.test.test_utils import is_in_ci

if is_in_ci():
    from patch import launch_server_cmd
else:
    from sglang.utils import launch_server_cmd


server_process, port = launch_server_cmd(
    "python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --tool-call-parser llama3 --host 0.0.0.0"  # llama3
)
wait_for_server(f"http://localhost:{port}")
[2025-02-23 08:35:46] server_args=ServerArgs(model_path='meta-llama/Meta-Llama-3.1-8B-Instruct', tokenizer_path='meta-llama/Meta-Llama-3.1-8B-Instruct', tokenizer_mode='auto', load_format='auto', trust_remote_code=False, dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, quantization=None, context_length=None, device='cuda', served_model_name='meta-llama/Meta-Llama-3.1-8B-Instruct', chat_template=None, is_embedding=False, revision=None, skip_tokenizer_init=False, host='0.0.0.0', port=34913, mem_fraction_static=0.88, max_running_requests=200, max_total_tokens=20480, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='lpm', schedule_conservativeness=1.0, cpu_offload_gb=0, prefill_only_one_req=False, tp_size=1, stream_interval=1, stream_output=False, random_seed=910824835, constrained_json_whitespace_pattern=None, watchdog_timeout=300, download_dir=None, base_gpu_id=0, log_level='info', log_level_http=None, log_requests=False, show_time_cost=False, enable_metrics=False, decode_log_interval=40, api_key=None, file_storage_pth='sglang_storage', enable_cache_report=False, dp_size=1, load_balance_method='round_robin', ep_size=1, dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', lora_paths=None, max_loras_per_batch=8, lora_backend='triton', attention_backend='flashinfer', sampling_backend='flashinfer', grammar_backend='outlines', speculative_draft_model_path=None, speculative_algorithm=None, speculative_num_steps=5, speculative_num_draft_tokens=64, speculative_eagle_topk=8, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, disable_radix_cache=False, disable_jump_forward=False, disable_cuda_graph=True, disable_cuda_graph_padding=False, enable_nccl_nvls=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, disable_mla=False, disable_overlap_schedule=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_ep_moe=False, enable_torch_compile=False, torch_compile_max_bs=32, cuda_graph_max_bs=160, cuda_graph_bs=None, torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, num_continuous_decode_steps=1, delete_ckpt_after_loading=False, enable_memory_saver=False, allow_auto_truncate=False, return_hidden_states=False, enable_custom_logit_processor=False, tool_call_parser='llama3', enable_hierarchical_cache=False, enable_flashinfer_mla=False)
[2025-02-23 08:36:04 TP0] Init torch distributed begin.
[2025-02-23 08:36:04 TP0] Load weight begin. avail mem=78.81 GB
[2025-02-23 08:36:05 TP0] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.06it/s]
Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.70it/s]
Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.34it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.18it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.24it/s]

[2025-02-23 08:36:09 TP0] Load weight end. type=LlamaForCausalLM, dtype=torch.bfloat16, avail mem=63.70 GB
[2025-02-23 08:36:09 TP0] KV Cache is allocated. K size: 1.25 GB, V size: 1.25 GB.
[2025-02-23 08:36:09 TP0] Memory pool end. avail mem=60.98 GB
[2025-02-23 08:36:10 TP0] max_total_num_tokens=20480, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=200, context_len=131072
[2025-02-23 08:36:10] INFO:     Started server process [1034512]
[2025-02-23 08:36:10] INFO:     Waiting for application startup.
[2025-02-23 08:36:10] INFO:     Application startup complete.
[2025-02-23 08:36:10] INFO:     Uvicorn running on http://0.0.0.0:34913 (Press CTRL+C to quit)
[2025-02-23 08:36:10] INFO:     127.0.0.1:37594 - "GET /v1/models HTTP/1.1" 200 OK
[2025-02-23 08:36:11] INFO:     127.0.0.1:35466 - "GET /get_model_info HTTP/1.1" 200 OK
[2025-02-23 08:36:11 TP0] Prefill batch. #new-seq: 1, #new-token: 7, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-02-23 08:36:14] INFO:     127.0.0.1:35482 - "POST /generate HTTP/1.1" 200 OK
[2025-02-23 08:36:14] The server is fired up and ready to roll!


NOTE: Typically, the server runs in a separate terminal.
In this notebook, we run the server and notebook code together, so their outputs are combined.
To improve clarity, the server logs are displayed in the original black color, while the notebook outputs are highlighted in blue.
We are running those notebooks in a CI parallel environment, so the throughput is not representative of the actual performance.

Note that --tool-call-parser defines the parser used to interpret responses. Currently supported parsers include:

  • llama3: Llama 3.1 / 3.2 (e.g. meta-llama/Llama-3.1-8B-Instruct, meta-llama/Llama-3.2-1B-Instruct).

  • mistral: Mistral (e.g. mistralai/Mistral-7B-Instruct-v0.3, mistralai/Mistral-Nemo-Instruct-2407, mistralai/ Mistral-Nemo-Instruct-2407, mistralai/Mistral-7B-v0.3).

  • qwen25: Qwen 2.5 (e.g. Qwen/Qwen2.5-1.5B-Instruct, Qwen/Qwen2.5-7B-Instruct).

Define Tools for Function Call#

Below is a Python snippet that shows how to define a tool as a dictionary. The dictionary includes a tool name, a description, and property defined Parameters.

[2]:
# Define tools
tools = [
    {
        "type": "function",
        "function": {
            "name": "get_current_weather",
            "description": "Get the current weather in a given location",
            "parameters": {
                "type": "object",
                "properties": {
                    "city": {
                        "type": "string",
                        "description": "The city to find the weather for, e.g. 'San Francisco'",
                    },
                    "state": {
                        "type": "string",
                        "description": "the two-letter abbreviation for the state that the city is"
                        " in, e.g. 'CA' which would mean 'California'",
                    },
                    "unit": {
                        "type": "string",
                        "description": "The unit to fetch the temperature in",
                        "enum": ["celsius", "fahrenheit"],
                    },
                },
                "required": ["city", "state", "unit"],
            },
        },
    }
]

Define Messages#

[3]:
def get_messages():
    return [
        {
            "role": "user",
            "content": "What's the weather like in Boston today? Please respond with the format: Today's weather is :{function call result}",
        }
    ]


messages = get_messages()

Initialize the Client#

[4]:
# Initialize OpenAI-like client
client = OpenAI(api_key="None", base_url=f"http://0.0.0.0:{port}/v1")
model_name = client.models.list().data[0].id
[2025-02-23 08:36:15] INFO:     127.0.0.1:35496 - "GET /v1/models HTTP/1.1" 200 OK

Non-Streaming Request#

[5]:
# Non-streaming mode test
response_non_stream = client.chat.completions.create(
    model=model_name,
    messages=messages,
    temperature=0.8,
    top_p=0.8,
    stream=False,  # Non-streaming
    tools=tools,
)
print_highlight("Non-stream response:")
print(response_non_stream)
[2025-02-23 08:36:15 TP0] Prefill batch. #new-seq: 1, #new-token: 302, #cached-token: 1, cache hit rate: 0.32%, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-02-23 08:36:16 TP0] Decode batch. #running-req: 0, #token: 0, token usage: 0.00, gen throughput (token/s): 6.67, #queue-req: 0
[2025-02-23 08:36:16] INFO:     127.0.0.1:35496 - "POST /v1/chat/completions HTTP/1.1" 200 OK
Non-stream response:
ChatCompletion(id='0b3c5a6efe374690be396130b6677d6e', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content=None, refusal=None, role='assistant', audio=None, function_call=None, tool_calls=[ChatCompletionMessageToolCall(id='0', function=Function(arguments='{"city": "Boston", "state": "MA", "unit": "fahrenheit"}', name='get_current_weather'), type='function')]), matched_stop=128008)], created=1740299776, model='meta-llama/Meta-Llama-3.1-8B-Instruct', object='chat.completion', service_tier=None, system_fingerprint=None, usage=CompletionUsage(completion_tokens=32, prompt_tokens=303, total_tokens=335, completion_tokens_details=None, prompt_tokens_details=None))

Streaming Request#

[6]:
# Streaming mode test
print_highlight("Streaming response:")
response_stream = client.chat.completions.create(
    model=model_name,
    messages=messages,
    temperature=0.8,
    top_p=0.8,
    stream=True,  # Enable streaming
    tools=tools,
)

chunks = []
for chunk in response_stream:
    chunks.append(chunk)
    if chunk.choices[0].delta.tool_calls:
        print(chunk.choices[0].delta.tool_calls[0])
Streaming response:
[2025-02-23 08:36:16] INFO:     127.0.0.1:35496 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-02-23 08:36:16 TP0] Prefill batch. #new-seq: 1, #new-token: 1, #cached-token: 302, cache hit rate: 49.43%, token usage: 0.01, #running-req: 0, #queue-req: 0
ChoiceDeltaToolCall(index=None, id='0', function=ChoiceDeltaToolCallFunction(arguments='', name='get_current_weather'), type='function')
ChoiceDeltaToolCall(index=None, id='0', function=ChoiceDeltaToolCallFunction(arguments='{"city": "', name=''), type='function')
ChoiceDeltaToolCall(index=None, id='0', function=ChoiceDeltaToolCallFunction(arguments='Boston"', name=''), type='function')
ChoiceDeltaToolCall(index=None, id='0', function=ChoiceDeltaToolCallFunction(arguments=', "state": "', name=''), type='function')
ChoiceDeltaToolCall(index=None, id='0', function=ChoiceDeltaToolCallFunction(arguments='MA"', name=''), type='function')
ChoiceDeltaToolCall(index=None, id='0', function=ChoiceDeltaToolCallFunction(arguments=', "unit": "', name=''), type='function')
ChoiceDeltaToolCall(index=None, id='0', function=ChoiceDeltaToolCallFunction(arguments='f', name=''), type='function')
ChoiceDeltaToolCall(index=None, id='0', function=ChoiceDeltaToolCallFunction(arguments='ahrenheit"}', name=''), type='function')

Handle Tool Calls#

When the engine determines it should call a particular tool, it will return arguments or partial arguments through the response. You can parse these arguments and later invoke the tool accordingly.

Non-Streaming Request

[7]:
name_non_stream = response_non_stream.choices[0].message.tool_calls[0].function.name
arguments_non_stream = (
    response_non_stream.choices[0].message.tool_calls[0].function.arguments
)

print_highlight(f"Final streamed function call name: {name_non_stream}")
print_highlight(f"Final streamed function call arguments: {arguments_non_stream}")
Final streamed function call name: get_current_weather
Final streamed function call arguments: {"city": "Boston", "state": "MA", "unit": "fahrenheit"}

Streaming Request

[8]:
# Parse and combine function call arguments
arguments = []
for chunk in chunks:
    choice = chunk.choices[0]
    delta = choice.delta
    if delta.tool_calls:
        tool_call = delta.tool_calls[0]
        if tool_call.function.name:
            print_highlight(f"Streamed function call name: {tool_call.function.name}")

        if tool_call.function.arguments:
            arguments.append(tool_call.function.arguments)
            print(f"Streamed function call arguments: {tool_call.function.arguments}")

# Combine all fragments into a single JSON string
full_arguments = "".join(arguments)
print_highlight(f"Final streamed function call arguments: {full_arguments}")
Streamed function call name: get_current_weather
Streamed function call arguments: {"city": "
Streamed function call arguments: Boston"
Streamed function call arguments: , "state": "
Streamed function call arguments: MA"
Streamed function call arguments: , "unit": "
Streamed function call arguments: f
Streamed function call arguments: ahrenheit"}
Final streamed function call arguments: {"city": "Boston", "state": "MA", "unit": "fahrenheit"}

Define a Tool Function#

[9]:
# This is a demonstration, define real function according to your usage.
def get_current_weather(city: str, state: str, unit: "str"):
    return (
        f"The weather in {city}, {state} is 85 degrees {unit}. It is "
        "partly cloudly, with highs in the 90's."
    )


available_tools = {"get_current_weather": get_current_weather}

Execute the Tool#

[10]:
call_data = json.loads(full_arguments)

messages.append(
    {
        "role": "user",
        "content": "",
        "tool_calls": {"name": "get_current_weather", "arguments": full_arguments},
    }
)

# Call the corresponding tool function
tool_name = messages[-1]["tool_calls"]["name"]
tool_to_call = available_tools[tool_name]
result = tool_to_call(**call_data)
print_highlight(f"Function call result: {result}")
messages.append({"role": "tool", "content": result, "name": tool_name})

print_highlight(f"Updated message history: {messages}")
Function call result: The weather in Boston, MA is 85 degrees fahrenheit. It is partly cloudly, with highs in the 90's.
Updated message history: [{'role': 'user', 'content': "What's the weather like in Boston today? Please respond with the format: Today's weather is :{function call result}"}, {'role': 'user', 'content': '', 'tool_calls': {'name': 'get_current_weather', 'arguments': '{"city": "Boston", "state": "MA", "unit": "fahrenheit"}'}}, {'role': 'tool', 'content': "The weather in Boston, MA is 85 degrees fahrenheit. It is partly cloudly, with highs in the 90's.", 'name': 'get_current_weather'}]

Send Results Back to Model#

[11]:
final_response = client.chat.completions.create(
    model=model_name,
    messages=messages,
    temperature=0.8,
    top_p=0.8,
    stream=False,
    tools=tools,
)
print_highlight("Non-stream response:")
print(final_response)
[2025-02-23 08:36:16 TP0] Prefill batch. #new-seq: 1, #new-token: 41, #cached-token: 300, cache hit rate: 63.21%, token usage: 0.01, #running-req: 0, #queue-req: 0
[2025-02-23 08:36:16 TP0] Decode batch. #running-req: 1, #token: 350, token usage: 0.02, gen throughput (token/s): 71.76, #queue-req: 0
[2025-02-23 08:36:17] INFO:     127.0.0.1:35496 - "POST /v1/chat/completions HTTP/1.1" 200 OK
Non-stream response:
ChatCompletion(id='131916be5dd54e72a3fa197fe74f0f13', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content=None, refusal=None, role='assistant', audio=None, function_call=None, tool_calls=[ChatCompletionMessageToolCall(id='0', function=Function(arguments='{"city": "Boston", "state": "MA", "unit": "fahrenheit"}', name='get_current_weather'), type='function')]), matched_stop=128008)], created=1740299777, model='meta-llama/Meta-Llama-3.1-8B-Instruct', object='chat.completion', service_tier=None, system_fingerprint=None, usage=CompletionUsage(completion_tokens=32, prompt_tokens=341, total_tokens=373, completion_tokens_details=None, prompt_tokens_details=None))

Native API and SGLang Runtime (SRT)#

[12]:
from transformers import AutoTokenizer
import requests

# generate an answer
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3.1-8B-Instruct")

messages = get_messages()

input = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    tools=tools,
)

gen_url = f"http://localhost:{port}/generate"
gen_data = {"text": input, "sampling_params": {"skip_special_tokens": False}}
gen_response = requests.post(gen_url, json=gen_data).json()["text"]
print(gen_response)

# parse the response
parse_url = f"http://localhost:{port}/function_call"

function_call_input = {
    "text": gen_response,
    "tool_call_parser": "llama3",
    "tools": tools,
}

function_call_response = requests.post(parse_url, json=function_call_input)
function_call_response_json = function_call_response.json()
print("function name: ", function_call_response_json["calls"][0]["name"])
print("function arguments: ", function_call_response_json["calls"][0]["parameters"])
[2025-02-23 08:36:22 TP0] Prefill batch. #new-seq: 1, #new-token: 317, #cached-token: 1, cache hit rate: 47.48%, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-02-23 08:36:22 TP0] Decode batch. #running-req: 1, #token: 335, token usage: 0.02, gen throughput (token/s): 6.58, #queue-req: 0
[2025-02-23 08:36:23] INFO:     127.0.0.1:56736 - "POST /generate HTTP/1.1" 200 OK
<|python_tag|>{"name": "get_current_weather", "parameters": {"unit": "celsius", "state": "MA", "city": "Boston"}}
[2025-02-23 08:36:23] INFO:     127.0.0.1:56744 - "POST /function_call HTTP/1.1" 200 OK
function name:  get_current_weather
function arguments:  {"unit": "celsius", "state": "MA", "city": "Boston"}
[13]:
terminate_process(server_process)

Offline Engine API#

[14]:
import sglang as sgl
from sglang.srt.function_call_parser import FunctionCallParser
from sglang.srt.managers.io_struct import Tool, Function

llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")
tokenizer = llm.orchestrator.tokenizer
input_ids = tokenizer.apply_chat_template(
    messages, tokenize=True, add_generation_prompt=True, tools=tools
)

sampling_params = {
    "max_new_tokens": 128,
    "temperature": 0.3,
    "top_p": 0.95,
    "skip_special_tokens": False,
}

# 1) Offline generation
result = llm.generate(input_ids=input_ids, sampling_params=sampling_params)
generated_text = result["text"]  # Assume there is only one prompt

print("=== Offline Engine Output Text ===")
print(generated_text)


# 2) Parse using FunctionCallParser
def convert_dict_to_tool(tool_dict: dict) -> Tool:
    function_dict = tool_dict.get("function", {})
    return Tool(
        type=tool_dict.get("type", "function"),
        function=Function(
            name=function_dict.get("name"),
            description=function_dict.get("description"),
            parameters=function_dict.get("parameters"),
        ),
    )


tools = [convert_dict_to_tool(raw_tool) for raw_tool in tools]

parser = FunctionCallParser(tools=tools, tool_call_parser="llama3")
normal_text, calls = parser.parse_non_stream(generated_text)

print("\n=== Parsing Result ===")
print("Normal text portion:", normal_text)
print("Function call portion:")
for call in calls:
    # call: ToolCallItem
    print(f"  - tool name: {call.name}")
    print(f"    parameters: {call.parameters}")

# 3) If needed, perform additional logic on the parsed functions, such as automatically calling the corresponding function to obtain a return value, etc.
Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.01it/s]
Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.63it/s]
Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.25it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.06s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.05it/s]

=== Offline Engine Output Text ===
<|python_tag|>{"name": "get_current_weather", "parameters": {"city": "Boston", "state": "MA", "unit": "fahrenheit"}}

=== Parsing Result ===
Normal text portion: <|python_tag|>{"name": "get_current_weather", "parameters": {"city": "Boston", "state": "MA", "unit": "fahrenheit"}}
Function call portion:
  - tool name: get_current_weather
    parameters: {"city": "Boston", "state": "MA", "unit": "fahrenheit"}
[15]:
llm.shutdown()

How to support a new model?#

  1. Update the TOOLS_TAG_LIST in sglang/srt/function_call_parser.py with the model’s tool tags. Currently supported tags include:

TOOLS_TAG_LIST = [
    “<|plugin|>“,
    “<function=“,
    “<tool_call>“,
    “<|python_tag|>“,
    “[TOOL_CALLS]”
]
  1. Create a new detector class in sglang/srt/function_call_parser.py that inherits from BaseFormatDetector. The detector should handle the model’s specific function call format. For example:

class NewModelDetector(BaseFormatDetector):
  1. Add the new detector to the MultiFormatParser class that manages all the format detectors.