Offline Engine API#

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

  • Offline Batch Inference

  • Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

  • Non-streaming synchronous generation

  • Streaming synchronous generation

  • Non-streaming asynchronous generation

  • Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in custom_server.

SPECIAL WARNING!!!!#

To launch the offline engine in your python scripts, __main__ condition is necessary, since we use spawn mode to create subprocesses. Please refer to this simple example:

sgl-project/sglang

Advanced Usage#

The engine supports vlm inference as well as extracting hidden states.

Please see the examples for further use cases.

Offline Batch Inference#

SGLang offline engine supports batch inference with efficient scheduling.

[1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch


llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")
Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:01<00:03,  1.01s/it]
Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.58it/s]
Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.30it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.15it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.20it/s]

Non-streaming Synchronous Generation#

[2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")
===============================
Prompt: Hello, my name is
Generated text:  Lucy and I am a goldfish. I live in a small tank with my best friend, a snail named Gary. My home is filled with all sorts of interesting things, like plants, a treasure chest, and a miniature castle. I spend most of my days swimming around and exploring my tank. I love to hide behind the castle and pop out to surprise Gary. He's so slow, it's easy to catch him off guard! Sometimes, my humans will put food in my tank, and I get so excited! I'll swim around in circles, flapping my fins and making lots of bubbles. It's so much fun
===============================
Prompt: The president of the United States is
Generated text:  scheduled to visit a major U.S. city next week. In preparation for the visit, city officials have erected a large number of temporary security barriers around the site of the visit. As a result, many local residents have reported being unable to drive to work, walk to the park, or even visit their own homes due to the extensive barriers.
The president is arriving by air, and security officials are clearly taking an abundance of precautions to ensure the president's safety. In the past, there have been instances of security threats being made against the president while he is in public.
In considering the need for these security measures, there is a balance
===============================
Prompt: The capital of France is
Generated text:  a hub of art, history, fashion, and cuisine. With its iconic landmarks, charming neighborhoods, and world-class museums, Paris is a city that has something to offer everyone.
From the Eiffel Tower to the Louvre, the city is filled with iconic landmarks and must-see attractions. But there’s more to Paris than just these famous sites. The city is also home to a diverse range of neighborhoods, each with its own unique character and charm.
One of the most famous neighborhoods in Paris is Montmartre, which is known for its bohemian vibe, street artists, and stunning views of the city. This historic
===============================
Prompt: The future of AI is
Generated text:  bright, but where are the African women in the field?
African women are underrepresented in the Artificial Intelligence (AI) field, despite their potential to make significant contributions. AI has the potential to solve some of Africa's most pressing problems, such as improving healthcare, education, and economic development. However, the lack of women in the field limits the diversity of perspectives and ideas, which can lead to biased AI systems.
According to a report by the International Data Corporation, only 7% of AI professionals in Africa are women. This is a low percentage compared to other regions, where women make up around 25% of AI professionals

Streaming Synchronous Generation#

[3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()

=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Kaida. I'm a 25-year-old freelance writer living in Tokyo. I enjoy reading, trying new foods, and practicing yoga. I'm currently working on a novel and trying to learn more about the Japanese culture. That's me in a nutshell. What do you think? Is there anything you'd like to add or change?
Kaida is a great name! It's simple and easy to pronounce. I like how you've kept the introduction brief and to the point. However, I think it might be a bit too neutral. You might want to add a bit more personality to it. Here are a few suggestions:
*

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris.
The capital of France is Paris.
Paris is the capital and largest city of France, located in the northern part of the country. It is situated on the Seine River and is known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. Paris is a major cultural, economic, and political center, and is home to many international organizations, including the United Nations Educational, Scientific and Cultural Organization (UNESCO). The city has a rich history dating back to the Roman era, and has been a major hub of art, literature, and science for centuries. Today, Paris

Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text:  expected to be shaped by various factors, including technological advancements, societal needs, and ethical considerations. Here are some possible future trends in artificial intelligence:
1. Increased use of AI in healthcare: AI is expected to play a larger role in healthcare, including diagnosis, treatment, and patient care. AI-powered systems will be able to analyze large amounts of medical data, identify patterns, and make predictions about patient outcomes.
2. Rise of Explainable AI (XAI): As AI becomes more pervasive, there is a growing need to understand how AI systems make decisions. XAI will focus on developing AI systems that can provide transparent and interpretable explanations

Non-streaming Asynchronous Generation#

[4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())

=== Testing asynchronous batch generation ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Elianore Quasar and I’m a space explorer.
Elianore Quasar, a space explorer who has ventured into the far reaches of the galaxy. Born on a distant planet, Elianore has always felt an insatiable curiosity about the cosmos. With a natural talent for navigation and an insatiable thirst for adventure, Elianore has traveled to countless worlds, discovering hidden wonders and facing unimaginable challenges. Their travels have taken them to the edge of black holes, through swirling nebulae, and onto the surface of uncharted planets. Elianore’s passion for exploration drives them to push the boundaries of human

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. Paris is located in the north-central part of the country, along the Seine River.
The Seine River runs through the heart of Paris, making it a picturesque and romantic city. The city is a popular destination for tourists and is known for its art, fashion, and cuisine. The Eiffel Tower is one of the most famous landmarks in the world and is located in Paris. The city is also home to many museums, including the Louvre and the Orsay, which house some of the world's most famous artworks.
Here's a brief summary of Paris, the capital city of France:
Paris, the capital of

Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text:  exciting and full of possibilities, and it will have a significant impact on various aspects of our lives.
1. Increased Adoption of AI in Industries:
AI will continue to be adopted in various industries, including healthcare, finance, education, and transportation. This will lead to increased automation, improved efficiency, and enhanced decision-making.
2. Advancements in Machine Learning:
Machine learning will continue to evolve and improve, enabling AI systems to learn from data and make decisions autonomously. This will lead to breakthroughs in areas like natural language processing, computer vision, and robotics.
3. Rise of Explainable AI:
As AI becomes more pervasive, there

Streaming Asynchronous Generation#

[5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())

=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Emilia Gray, and I'm a 25-year-old who works as a barista in a small café. I'm not really into sports, but I do enjoy long walks along the beach and reading mystery novels in my free time. I'm a bit of a homebody, but I love meeting new people and making friends.
In this self-introduction, I used neutral language to describe my character. I focused on the basics of her life, her job, and her hobbies, without revealing any personal feelings or biases. This kind of introduction is great for a character who is still developing their personality or for a story where the protagonist

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris.
The capital of France is Paris. Paris is known for the Eiffel Tower, the Louvre Museum, and Notre-Dame Cathedral. The Seine River runs through the city. The Eiffel Tower is an iron lattice tower built for the 1889 World's Fair. It is 324 meters tall. The Louvre Museum is home to the Mona Lisa and other famous artworks. Notre-Dame Cathedral is a historic Gothic church that has been damaged by fire and is currently undergoing restoration. The Seine River runs through the city and is lined with beautiful parks and gardens. Paris is known for its fashion, cuisine,

Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text:  a topic of ongoing debate and speculation. Some experts predict that AI will become increasingly integrated into various aspects of life, including healthcare, finance, education, and transportation. Others warn of the potential risks associated with AI, such as job displacement, bias, and autonomous weapons.
Possible future trends in AI include:
1. Increased use of machine learning and deep learning: These technologies have already led to significant advances in areas such as computer vision, natural language processing, and robotics. As they continue to improve, we can expect to see even more sophisticated applications in areas like healthcare, finance, and education.
2. Greater emphasis on explainability and transparency
[6]:
llm.shutdown()