Offline Engine API#

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

Offline Batch Inference
Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

Non-streaming synchronous generation
Streaming synchronous generation
Non-streaming asynchronous generation
Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in custom_server.

Nest Asyncio#

Note that if you want to use Offline Engine in ipython or some other nested loop code, you need to add the following code:

import nest_asyncio

nest_asyncio.apply()

Advanced Usage#

The engine supports vlm inference as well as extracting hidden states.

Please see the examples for further use cases.

Offline Batch Inference#

SGLang offline engine supports batch inference with efficient scheduling.

[1]:

# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
  import pynvml  # type: ignore[import]

All deep_gemm operations loaded successfully!

`torch_dtype` is deprecated! Use `dtype` instead!
WARNING:transformers.configuration_utils:`torch_dtype` is deprecated! Use `dtype` instead!
/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
  import pynvml  # type: ignore[import]
/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
  import pynvml  # type: ignore[import]
`torch_dtype` is deprecated! Use `dtype` instead!
[2025-09-28 18:32:10] `torch_dtype` is deprecated! Use `dtype` instead!

[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
All deep_gemm operations loaded successfully!

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.27it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.26it/s]

Capturing batches (bs=1 avail_mem=39.59 GB): 100%|██████████| 20/20 [00:16<00:00,  1.22it/s]

Non-streaming Synchronous Generation#

[2]:

prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

===============================
Prompt: Hello, my name is
Generated text:  Rick. I am a senior at the University of Washington, and I am majoring in physical and cognitive neuroscience. I have a strong interest in physics, biophysics, and artificial intelligence. My hobbies include playing tennis, driving, and reading.

I am the Coordinator for the Science Education Club and also serve as the president of the Science Club. I have written articles for the University of Washington Online and have served on the University of Washington Student Government.

I am currently a Research Assistant in the Department of Neurosurgery at the University of Washington. My research interests are in both clinical and basic neuroscience. I am interested in understanding the neural
===============================
Prompt: The president of the United States is
Generated text:  visiting a small town. He is standing at a point in the town such that the town is described by the coordinates (1, 2). The president walks a distance of 5 units due north from his starting point. What are the new coordinates of the president?
To determine the new coordinates of the president after walking 5 units due north, we need to follow these steps:

1. Identify the original coordinates of the town. The town is described by the coordinates (1, 2).
2. Determine the direction the president is walking. Since the president is walking 5 units due north, we can represent this as a change
===============================
Prompt: The capital of France is
Generated text:  Paris, and it is in the north of the country. The capital of France is located on the Seine River. To get there, you can take the train, the bus, or the plane.

This time, let's focus on the train. The distance from the main station of Paris to the train station of the capital of France is 10 km. It takes 1 hour to reach. However, there is a train delay of 5 minutes.

Assuming that the delay will not be repeated for the next hour, what is the total time needed to reach the capital of France?
To determine the total time needed
===============================
Prompt: The future of AI is
Generated text:  bumpy, but the future of the human race is bright.

The future of AI is bumpy, but the future of the human race is bright.

It's a phrase I hear often, and it's a reminder that with the incredible advancements in the field of Artificial Intelligence, we are now part of the collective human race. AI is becoming more and more pervasive, with more and more jobs being automated and people spending more time in front of screens. It's a scary thought that we are losing our ability to think, our ability to connect with others, and our ability to make important decisions. And yet, it's also a thought

Streaming Synchronous Generation#

[3]:

prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()

=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [job title] at [company name], and I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [job title] at [company name], and I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [job title] at [company name], and I'm excited to meet you and learn more about you. What

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: Paris.

A. True
B. False
A. True

Paris is the capital of France and is known for its iconic landmarks such as the Eiffel Tower, Louvre Museum, and Notre-Dame Cathedral. It is also a major cultural and economic center in Europe. The city is home to many famous museums, including the Louvre, the Musée d'Orsay, and the Musée Rodin. Paris is also known for its rich history, including the Romanesque and Gothic architecture, and its role in the French Revolution and the French Revolution. The city is also home to many famous artists, including

Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: likely to be characterized by rapid advancements in areas such as machine learning, natural language processing, and computer vision. These technologies will continue to improve, leading to more sophisticated and accurate AI systems that can perform a wide range of tasks, from simple tasks like language translation to complex tasks like autonomous driving and medical diagnosis. As AI becomes more integrated into our daily lives, we may see a shift towards more personalized and context-aware AI systems that can adapt to our needs and preferences. Additionally, AI will continue to be used for a wide range of applications, from improving healthcare and education to enhancing transportation and manufacturing processes. Finally, AI will continue to be

Non-streaming Asynchronous Generation#

[4]:

prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name]. I am a [age] year old [gender] and [occupation]. I am [gender] and [name]. I love [profession], I am passionate about [activities], and I am a [personality]. I am [occupation] and [gender]. I am [name]. I have [number] years of experience, [age] years of experience, and [years of experience]. I am always eager to learn and grow as a person. I am a [personality] who is always [positive]. I am [occupation] and [gender]. I am [name]. I have [number] years

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris.

Please answer the following question about the capital city:
Who is the president of France? The president of France is the President of the French Republic.

Please answer the following question about the capital city:
What is the flag of France? The flag of France is called the French tricolor. It consists of three horizontal stripes of red, white, and blue.

Please answer the following question about the capital city:
What is the official language of France? French is the official language of France. It is spoken by approximately 35 million people and is the mother tongue of the largest minority group in France, the Gallic

Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text:  likely to be dominated by the following trends:

1. Increased depth of AI: AI will continue to get better at understanding complex problems and making more accurate predictions. We can expect more sophisticated algorithms to become available to assist with decision-making, such as in healthcare, finance, and transportation.

2. Emergence of AI ethics: As AI becomes more advanced, it will become more complex, and we will need to create more ethical guidelines for its use. This will require collaboration between AI researchers, policymakers, and ethicists to develop rules and regulations.

3. Rise of AI in customer service: AI-powered chatbots and virtual assistants will become more

Streaming Asynchronous Generation#

[5]:

prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())

=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: [insert name]. I'm a [insert your profession or occupation] and I've always been passionate about writing. I love to write stories that are engaging and entertaining for my readers. I have a great sense of humor and enjoy sharing my creations with others. My writing skills have always made me a favorite among my friends and family. I'm always looking for new ways to improve myself and learn new things in the field of writing. So, if you want to hear some of my writing, I'd love to share it with you. #SelfIntroduction #Writer #EngagingStory #FunnyPerson #WritingSkills #ProfessionalOccupation #

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: Paris, also known as the City of Light.

Paris is the largest city in France and serves as the capital of the country. It is home to many of France's most famous landmarks, including Notre-Dame Cathedral, the Eiffel Tower, and the Louvre Museum. The city is also known for its rich history and cultural attractions, including the Louvre Museum and the Musée d'Orsay. Paris is a vibrant and multicultural city with a strong sense of French identity and culture. Its status as the capital has helped to shape the city's history, economy, and influence on the entire nation. It is a global city

Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: full of exciting possibilities and possibilities for both positive and negative impacts. Here are some possible future trends in AI that have the potential to shape our future society:

1. Increased AI integration: AI is already playing an increasingly significant role in our daily lives, from self-driving cars to virtual assistants and chatbots. As more AI is integrated into our everyday lives, we may see even more integration of AI into our work and leisure activities. This could lead to a more automated workforce, increased efficiency and productivity, and potentially a more equitable distribution of wealth and resources.

2. AI-driven healthcare: AI is already being used to diagnose and treat diseases,

[6]:

llm.shutdown()

Offline Engine API

Contents