# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.65it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.64it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  David. I come from America. I have a pet dog named Max. This is Max's first birthday. I love Max very much. Here's a photo of me. Max's name is Max. He is 10 years old. He is a gold puppy. We meet in a big house. He is very happy. He lives with us. We love him very much. My parents love me very much, too. They like me very much. I have a brother, Peter, and a sister, Ann. Peter and Ann are the smallest. Ann is 5 years old. She likes to play with toys. And
Prompt: The president of the United States is
Generated text:  a very important person in the country. The president helps to make important decisions for the country. The president is also the head of the government, and he makes the laws that are passed. There are usually four members of the president's team, including the president himself. One of the duties of the president is to take care of the country. The president gets to be in charge of the country for four years, and then he has to

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [insert a short description of your character here]. I enjoy [insert a short description of your character's hobbies or interests]. I'm always looking for new challenges and opportunities to grow and learn. What's your favorite hobby or activity? I love [insert a short description of your favorite activity here]. I'm always looking for ways to improve myself and expand my knowledge. What's your favorite book or movie? I love [insert a short description

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, which is known for its iconic Eiffel Tower, Notre-Dame Cathedral, and vibrant French culture. It is also a major economic and political center, hosting numerous world-renowned museums, theaters, and landmarks. Paris is a popular tourist destination, attracting millions of visitors each year. The city is known for its rich history, art, and cuisine, and is a major hub for international business and diplomacy. The French capital is a cultural and political center of the world, and is a major tourist destination. Paris is a vibrant and dynamic city with a rich history and a diverse population. The city is also known for its iconic

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by rapid advancements in several key areas, including:

1. Increased integration with human intelligence: AI systems are likely to become more integrated with human intelligence, allowing them to learn and adapt to new situations and tasks. This could lead to more sophisticated and flexible AI systems that can perform a wider range of tasks.

2. Enhanced machine learning capabilities: AI systems are likely to become more capable of learning from data and making more accurate predictions and decisions. This could lead to more efficient and effective use of resources, as well as better decision-making in various industries.

3. Improved privacy and security: As AI systems become more integrated with



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name], and I'm a [job title] for [company name]. I am a self-proclaimed [vacancy] and I have a passion for [vacancy]. I'm a [status] employee who is dedicated to [vacancy]. I'm confident in my abilities and always strive to [vacancy] in whatever role I choose. I'm always looking for ways to [vacancy] and [vacancy], and I'm eager to learn and grow in my career. I'm also [vacancy] and [vacancy], and I'm always ready to make a difference in the workplace. I'm confident in

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris.

The statement is: "Paris, the capital of France, is renowned for its rich cultural heritage, iconic landmarks such as the Eiffel Tower, and its role in fostering international trade and diplomacy." This concise statement encapsulates the essence of Paris as France's most important and cul

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

],

 and

 I

'm

 a

 [

born

 in

 [

Year

]

 or

 born

 in

 [

Year

]

 (

Age

)]

 year

 old

.

 I

've

 always

 been

 fascinated

 by

 the

 world

 around

 me

 and

 the

 intric

acies

 of

 human

 emotions

,

 and

 I

've

 always

 sought

 to

 use

 my

 background

 in

 psychology

 to

 help

 others

 in

 need

.

 I

'm

 currently

 pursuing

 a

 [

un

iversity

]

 degree

 in

 [

my

 major

],

 but

 I

 don

't

 feel

 like

 I

'm

 fully

 exploring

 all

 of

 it

 yet

.

 I

'm

 always

 looking

 for

 new

 ways

 to

 learn

 and

 grow

,

 and

 I

'm

 always

 open

 to

 new

 experiences

 and

 adventures

.

 I

 enjoy

 working

 in

 teams

 and

 trying

 new

 things

,

 and

 I

'm

 always

 striving



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 a

 bustling

 met

ropolis

 with

 a

 rich

 history

 and

 culture

,

 known

 for

 its

 iconic

 landmarks

 such

 as

 the

 E

iff

el

 Tower

,

 the

 Lou

vre

 Museum

,

 and

 the

 Notre

-D

ame

 Cathedral

,

 as

 well

 as

 its

 diverse

 cuisine

,

 fashion

,

 and

 nightlife

.

 Paris

 is

 a

 hub

 for

 art

,

 fashion

,

 and

 culture

,

 attracting

 millions

 of

 visitors

 each

 year

,

 making

 it

 a

 global

 destination

 that

 is

 recognized

 as

 one

 of

 the

 world

’s

 most

 culturally

 significant

 cities

.

 The

 French

 capital

 is

 a

 perfect

 blend

 of

 tradition

 and

 modern

ity

,

 and

 it

 continues

 to

 be

 a

 leader

 in

 international

 affairs

 and

 diplomacy

.

 



Note

:

 This

 statement

 includes

 the

 elements

 of

 the

 historical

 background

,

 cultural



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely

 to

 be

 characterized

 by

 rapid

 advancements

 in

 several

 areas

,

 including

:



1

.

 More

 advanced

 algorithms

:

 AI

 systems

 will

 get

 even

 better

 at

 recognizing

 patterns

 and

 making

 predictions

,

 while

 also

 learning

 from

 data

 to

 improve

 their

 performance

.



2

.

 Increased

 reliance

 on

 data

:

 The

 more

 data

 we

 have

,

 the

 better

 we

 can

 predict

 and

 make

 decisions

 based

 on

 it

.

 This

 will

 drive

 even

 more

 innovation

 in

 AI

.



3

.

 AI

 for

 personal

 use

:

 We

'll

 see

 a

 growing

 focus

 on

 developing

 AI

 that

 is

 useful

 and

 easy

 to

 use

 for

 everyday

 tasks

,

 such

 as

 home

 automation

,

 healthcare

,

 and

 education

.



4

.

 Rob

otic

 augmentation

:

 Robots

 will

 become

 more

 integrated

 into

 our

 lives

,




In [6]:
llm.shutdown()