Offline Engine API#

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

  • Offline Batch Inference

  • Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

  • Non-streaming synchronous generation

  • Streaming synchronous generation

  • Non-streaming asynchronous generation

  • Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in custom_server.

Nest Asyncio#

Note that if you want to use Offline Engine in ipython or some other nested loop code, you need to add the following code:

import nest_asyncio

nest_asyncio.apply()

Advanced Usage#

The engine supports vlm inference as well as extracting hidden states.

Please see the examples for further use cases.

Offline Batch Inference#

SGLang offline engine supports batch inference with efficient scheduling.

[1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")
W0814 06:21:23.793000 1222870 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
W0814 06:21:23.793000 1222870 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
W0814 06:21:32.041000 1223576 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
W0814 06:21:32.041000 1223576 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.36it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.35it/s]

Capturing batches (bs=1 avail_mem=76.96 GB): 100%|██████████| 3/3 [00:00<00:00, 10.96it/s]

Non-streaming Synchronous Generation#

[2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")
===============================
Prompt: Hello, my name is
Generated text:  Xander. I am 17 years old and I'm in high school. I have a very active social life, particularly with my friends. I enjoy playing sports and am very good at it. However, I also like to spend time on my phone and watch movies. Recently, I've been having trouble with anxiety, and I've been seeking help from a therapist.

What is your opinion on the therapy process and your own personal journey towards finding relief? Should I pursue therapy or skip it altogether? I'm not sure what I should do, and I'm feeling overwhelmed by the prospect of seeking help. Can you offer any advice
===============================
Prompt: The president of the United States is
Generated text:  trying to decide what sport to promote for the upcoming summer. After considering the budget, the cost of hiring a coach is $20,000, and the cost of buying a pool table is $10,000. The president also considers that the pool table will be installed in $400,000 budget and will be operational for 8 months. He wants to maximize the profit. What is the maximum profit the president can get from promoting the pool table? The profit from promoting the pool table would be the difference between the profit from the pool table and the cost of the pool table. The
===============================
Prompt: The capital of France is
Generated text:  _________. A. Paris B. London C. Rome D. New York
The capital of France is Paris. Therefore, the correct answer is:

A. Paris

Paris is the capital city of France and is renowned for its iconic Eiffel Tower and historical landmarks like the Louvre Museum. While London, Rome, and New York are also important cities in Europe, Paris stands out as the capital city of France due to its status as the country's economic, cultural, and political center. The Eiffel Tower, mentioned in your question, is a famous landmark in Paris. London is known for its Tower of London,
===============================
Prompt: The future of AI is
Generated text:  in the hands of the customers

There is a lot of buzz about the future of artificial intelligence (AI). There are a lot of discussions about how AI is evolving and how its impact will shape our lives. But what happens if the AI is not designed with the customer's needs in mind? What happens if the AI is not designed with the customer's needs in mind? The customer's needs are of a critical importance and an important part of the innovation process. This is why customer input is necessary in every phase of the AI development process. By incorporating customer insights into the AI development process, we can create solutions that better meet the needs

Streaming Synchronous Generation#

[3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()

=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name], and I'm a [Age] year old [Occupation]. I'm a [Type of Vehicle] driver, and I've been driving for [Number of Years] years. I'm passionate about [Favorite Activity/Interest], and I enjoy [Reason for Passion]. I'm always looking for new experiences and learning new things, and I'm always eager to improve myself. I'm a [Type of Person] who is [Positive Traits], and I'm always ready to help others. I'm a [Type of Person] who is [Positive Traits], and I'm always ready to help others. I'm a [

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, the city known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. It is also home to the French Parliament and the French Quarter, where many famous French artists and intellectuals have lived and worked. Paris is a bustling metropolis with a rich cultural heritage and is a major tourist destination. It is also known for its cuisine, including the famous croissants and its famous cheese, Brie. Paris is a city of contrasts, with its modern architecture and historical landmarks blending seamlessly into one another. It is a city that has played a significant role in French history and culture

Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text:  likely to be characterized by a number of trends that are expected to shape the way we live, work, and interact with technology. Here are some potential trends that are likely to shape the future of AI:

1. Increased use of AI in healthcare: AI is already being used in healthcare to improve patient outcomes and reduce costs. As AI technology continues to improve, we can expect to see even more widespread use of AI in healthcare, with more sophisticated algorithms and machine learning techniques being developed to improve diagnosis, treatment, and patient care.

2. Increased use of AI in finance: AI is already being used in finance to improve risk management, fraud

Non-streaming Asynchronous Generation#

[4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())

=== Testing asynchronous batch generation ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Your Name], and I'm a [Type] who enjoy [Your interests or hobbies]. I'm a [Age] year old and I love [Your hobbies or activities]. I'm looking for a role that would allow me to [Your preferred role or task]. If you have any specific information about me that you'd like to share, please let me know and I'll be happy to share it with you. [Your Name] [Your Contact Information] [Your Image or Profile Picture] [Your Message] [Your Contact Information] [Your Image or Profile Picture] [Your Message]

(Click "Save" to save your

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. It is the largest city and the most populous of the 20 provinces of France. The city sits at the foot of the Île de la Cité, near the Seine River, and is home to the Eiffel Tower, Louvre Museum, Notre-Dame Cathedral, and the Palace of Versailles. Paris is known for its rich history, art, music, literature, and cuisine, as well as its status as a global city. The city has a diverse population of around 11 million people, which is the largest in the world. The population density of Paris is high, with around 3

Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text:  full of possibilities and potential innovations. Some of the most exciting and promising trends include:

1. Increased use of AI in healthcare: AI is already being used in healthcare to improve diagnosis accuracy, predict patient outcomes, and personalize treatment plans. As more AI algorithms become more sophisticated, we can expect to see even more innovative applications in the future.

2. Autonomous vehicles: As autonomous vehicles become more prevalent, we can expect to see more AI-driven technology in transportation, including self-driving cars, drones, and robot arms.

3. Augmented reality and virtual reality: These technologies are already being used in education and entertainment, but as they become more

Streaming Asynchronous Generation#

[5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())

=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name], and I am a young professional with [Number] years of experience in [Industry or field of work]. I am a passionate [reason for passion], and I believe that [reason for passion] has a significant impact on my ability to succeed and achieve success. I am a [reason for passion] and have always [reason for passion] in my professional life. I am a team player, and I believe that everyone has the potential to make a positive impact. I am excited to bring my unique perspective and skills to any project I am asked to work on, and I am always looking for ways to improve myself and my

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, known as "La Ville Noire" or "La Rose de l'Amour" in French. It is the city of the kings, the city of the muses, and the city of love, with an estimated population of over 2 million people. Paris is famous for its architecture, art, and cuisine, and it has a long history dating back over 2000 years. The city is also a major center for business and finance, and it is a UNESCO World Heritage Site. The French people have a rich cultural heritage and enjoy a unique blend of art, literature, and cuisine. The city is

Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text:  diverse and constantly evolving, with many potential trends shaping its development. Here are some of the most likely trends that could shape the AI industry in the coming years:

1. Natural Language Processing (NLP): AI that can understand and generate human language will continue to advance. NLP will become more advanced and more integrated with other AI technologies, such as image recognition and computer vision.

2. Deep Learning: Deep learning is the most prevalent AI technology in use today. It will continue to improve and become more efficient. Deep learning is particularly useful for image recognition, natural language processing, and speech recognition.

3. Reinforcement Learning: Reinforcement
[6]:
llm.shutdown()