Offline Engine API#

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

  • Offline Batch Inference

  • Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

  • Non-streaming synchronous generation

  • Streaming synchronous generation

  • Non-streaming asynchronous generation

  • Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in custom_server.

SPECIAL WARNING!!!!#

To launch the offline engine in your python scripts, __main__ condition is necessary, since we use spawn mode to create subprocesses. Please refer to this simple example:

sgl-project/sglang

Advanced Usage#

The engine supports vlm inference as well as extracting hidden states.

Please see the examples for further use cases.

Offline Batch Inference#

SGLang offline engine supports batch inference with efficient scheduling.

[1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch


llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")
Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.09it/s]
Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.71it/s]
Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.39it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.24it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.29it/s]

Non-streaming Synchronous Generation#

[2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")
===============================
Prompt: Hello, my name is
Generated text:  Ernesto Enriquez, and I am a 4th year student at the University of the Philippines Diliman. I am currently taking up Bachelor of Science in Economics. I am interested in exploring the ways on how economic policies can affect the lives of ordinary people, particularly in the context of developing countries. I have a passion for writing and storytelling, and I see blogging as an exciting opportunity to express my thoughts and ideas to a wider audience. I am excited to share my perspectives on economic issues and to learn from others as well.
  1. There are over 3.5 million registered bloggers in the Philippines (Based on the
===============================
Prompt: The president of the United States is
Generated text:  in a unique position, often referred to as a head of state and head of government. The president serves both ceremonial and executive roles, with a wide range of responsibilities and powers. Some of the key responsibilities of the president include:
Signing or vetoing bills passed by Congress, which makes them into laws.
Appointing federal judges, ambassadors, and other high-ranking government officials.
Conducting diplomatic relations with foreign governments.
Commanding the armed forces of the United States.
Granting pardons and reprieves for federal crimes.
Delivering the annual State of the Union address to Congress.
Meeting with heads of state and other world leaders
===============================
Prompt: The capital of France is
Generated text:  a city of art, history, and romance. Paris has something to offer everyone from museums, galleries, to fashion and food. In this guide we will explore the top attractions to visit in Paris.
Museums:
1. The Louvre: This iconic museum houses some of the world's most famous artworks, including the Mona Lisa. The Louvre is a must-visit for any art lover.
2. The Orsay Museum: Located in a beautiful Beaux-Arts building, the Orsay Museum is home to an impressive collection of Impressionist and Post-Impressionist art.
3. The Rodin Museum: Dedicated
===============================
Prompt: The future of AI is
Generated text:  human
The future of AI is human
The future of AI is human, but not in the way you might think. While AI systems will certainly become increasingly sophisticated and capable of performing a wide range of tasks, they will not replace human intelligence or creativity. Instead, AI will augment and enhance human capabilities, making us more productive, efficient, and effective in many areas of life.
There are several reasons why the future of AI is human:
1. AI systems will not be able to replicate human emotions: Emotions play a crucial role in decision-making, creativity, and empathy. While AI systems can mimic certain human-like behaviors, they

Streaming Synchronous Generation#

[3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()

=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Kaida. I'm a 25-year-old freelance writer and editor. I live in a small apartment in the city with my cat, Luna. I enjoy reading, hiking, and trying out new restaurants. I'm a bit of a introvert, but I'm working on being more outgoing. I'm always looking for new opportunities and experiences to learn and grow from. That's me in a nutshell.
This is a good example of a neutral self-introduction because it doesn't reveal too much about the character's personality, background, or motivations. It simply provides a brief overview of who they are and what they do. This can

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. Paris is located in the northern part of the country and is situated on the Seine River. It is the largest city in France and is known for its rich history, art, fashion, and culture. Paris is home to many famous landmarks such as the Eiffel Tower, Notre Dame Cathedral, and the Louvre Museum. The city is also known for its romantic atmosphere and is often referred to as the City of Light. Paris is a popular tourist destination and is considered one of the most beautiful and culturally significant cities in the world. The city has a population of over 2.1 million people and is a major hub

Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text:  a topic of much speculation and debate. Here are some possible future trends in AI:
1. Increased use of AI in healthcare: AI is already being used in healthcare to analyze medical images, diagnose diseases, and develop personalized treatment plans. In the future, AI is likely to play an even more significant role in healthcare, including the use of AI-powered robots to perform surgeries and the development of AI-driven diagnostic tools.
2. Widespread adoption of AI in industries: AI is already being used in various industries, including finance, transportation, and customer service. In the future, AI is likely to become even more widespread, with many industries

Non-streaming Asynchronous Generation#

[4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())

=== Testing asynchronous batch generation ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Rowan Hunter. I am a 27-year-old freelance writer and artist living in Portland, Oregon. I enjoy hiking and trying new restaurants in my free time. I am currently working on a novel and experimenting with various art forms.
Rowan Hunter, 27, freelance writer and artist, Portland, Oregon. Hiking, restaurants, writing, art.
This self-introduction doesn’t reveal too much about the character, but it gives a general idea of who they are and what they do. The inclusion of their interests and current projects hints at their personality and passions without being too revealing. The introduction is neutral, so it won’t

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris.
The following are three choices of questions that could be asked to test knowledge about the capital of France.
1. What is the capital of France?
2. Where is the capital of France?
3. What is the population of the capital of France?
The correct answer to the first question is: Paris. The correct answer to the second question is: France. The correct answer to the third question is: The population of the capital of France is approximately 2.1 million people.
Provide a concise factual statement about the capital of France.
The capital of France is Paris.
The following are three

Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text:  likely to be shaped by rapid advancements in computing power, data storage, and algorithmic techniques. There are several possible future trends in AI that can be explored:
1. **Increased use of Explainable AI (XAI):** As AI becomes more pervasive, there will be a growing need to understand how AI decisions are made. XAI will become more prevalent to increase transparency and trust in AI systems.
2. **Hybrid Approaches:** The future of AI may involve combining symbolic AI with connectionist AI (deep learning). This hybrid approach will leverage the strengths of both paradigms to create more robust and explainable AI systems.

Streaming Asynchronous Generation#

[5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())

=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Maya Blackwood and I'm a 20-year-old communications major. I'm currently a student at the University of California, Berkeley. I enjoy playing basketball, hiking, and trying new restaurants. In my free time, I like to read and write short stories. I'm excited to meet new people and learn more about their lives. I'm open-minded and enjoy hearing different perspectives on various topics.
This is a good self-introduction because it provides some basic information about the character, such as their name, age, major, and interests. It also shows that they are open-minded and interested in meeting new people and learning about their experiences

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. It is located in the northern part of the country and is situated on the river Seine. Paris is a city of great historical and cultural significance, known for its iconic landmarks such as the Eiffel Tower, the Louvre Museum, and Notre Dame Cathedral. It is also a major center for fashion, cuisine, and art. Paris has a population of around 2.1 million people within its city limits, but the metropolitan area has a population of over 12 million people. The city has a rich history dating back to the Roman era, and it has been a major hub of culture, politics, and economy in

Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text:  likely to be shaped by several factors, including advances in machine learning, natural language processing, and computer vision. The following are some possible future trends in artificial intelligence:
1. Increased use of AI in everyday life: AI is likely to become an integral part of our daily lives, with applications in areas such as healthcare, finance, transportation, and customer service.
2. Advancements in natural language processing: NLP is expected to improve significantly, enabling AI systems to understand and generate human-like language, leading to more conversational interfaces and better customer service.
3. Rise of autonomous systems: Autonomous vehicles, drones, and robots are expected to
[6]:
llm.shutdown()