Transformers fallback in SGLang#
sglang
can fall back to using models that are available in transformers
. This works for most decoder-style language models and support for vision-language models is coming soon!
Example launch Command#
By default, we will use sglang implementation if it is available. Otherwise, we will fall back to transformers one. However, you can switch the implementation by setting impl
to transformers
.
python3 -m sglang.launch_server \
--model-path meta-llama/Llama-3.2-1B-Instruct \
--host 0.0.0.0 \
--port 30000 \
--impl transformers
Supported features#
Quantization#
Transformers fall back has supported most of available quantization in SGLang (except GGUF). See Quantization page for more information about supported quantization in SGLang.
Remote code#
This fallback also means that any model on the hub that can be used in transformers
with trust_remote_code=True
that correctly implements attention can be used in production!
A model just needs the following two things:
from transformers import PreTrainedModel
from torch import nn
class MyAttention(nn.Module):
def forward(self, hidden_states, **kwargs): # <- kwargs are required
...
attention_interface = ALL_ATTENTION_FUNCTIONS[self.config._attn_implementation]
attn_output, attn_weights = attention_interface(
self,
query_states,
key_states,
value_states,
**kwargs,
)
...
class MyModel(PreTrainedModel):
_supports_attention_backend = True
Here is what happens in the background:
The config is loaded
MyModel
python class is loaded from theauto_map
, and we check that the model_supports_attention_backend
.The
TransformersModel
backend is used. See/srt/models/transformers
, which leveragesself.config._attn_implementation = "sglang"
, thus the need to useALL_ATTENTION_FUNCTIONS
.
That’s it!