Serving Runtime

Cluster Serving Runtime is a cluster-scoped resource that manages the runtime environment for model serving.

The only difference between the two is that one is namespace-scoped and the other is cluster-scoped.

A ClusterServingRuntime defines the templates for Pods that can serve one or more particular model. Each ClusterServingRuntime defines key information such as the container image of the runtime and a list of the models that the runtime supports. Other configuration settings for the runtime can be conveyed through environment variables in the container specification.

These CRDs allow for improved flexibility and extensibility, enabling users to quickly define or customize reusable runtimes without having to modify any controller code or any resources in the controller namespace.

The following is an example of a ClusterServingRuntime:

apiVersion: ome.io/v1beta1
kind: ClusterServingRuntime
metadata:
  name: srt-mistral-7b-instruct
spec:
  supportedModelFormats:
    - name: safetensors
      modelFormat:
        name: safetensors
        version: "1.0.0"
      modelFramework:
        name: transformers
        version: "4.36.2"
      modelArchitecture: MistralForCausalLM
      autoSelect: true
      priority: 1
  protocolVersions:
    - openAI
  modelSizeRange:
    max: 9B
    min: 5B
  engineConfig:
    runner:
      image: lmsysorg/sglang:v0.4.6.post6
      resources:
        requests:
          cpu: 10
          memory: 30Gi
          nvidia.com/gpu: 2
        limits:
          cpu: 10
          memory: 30Gi
          nvidia.com/gpu: 2
    minReplicas: 1
    maxReplicas: 3

Several out-of-the-box ClusterServingRuntimes are provided with OME so that users can quickly deploy common models without having to define the runtimes themselves.

SGLang Runtimes

Note: SGLang is our flagship supporting runtime, offering the latest serving engine with the most optimal performance. It provides cutting-edge features including multi-node serving capabilities, prefill-decode disaggregated serving, and Large-scale Cross-node Expert Parallelism (EP) for optimal performance at scale.

Name	Model Framework	Model Format	Model Architecture
deepseek-rdma-pd-rt	transformers	safetensors	DeepseekV3ForCausalLM
deepseek-rdma-rt	transformers	safetensors	DeepseekV3ForCausalLM
e5-mistral-7b-instruct-rt	transformers	safetensors	MistralModel
llama-3-1-70b-instruct-pd-rt	transformers	safetensors	LlamaForCausalLM
llama-3-1-70b-instruct-rt	transformers	safetensors	LlamaForCausalLM
llama-3-2-11b-vision-instruct-rt	transformers	safetensors	MllamaForConditionalGeneration
llama-3-2-1b-instruct-pd-rt	transformers	safetensors	LlamaForCausalLM
llama-3-2-1b-instruct-rt	transformers	safetensors	LlamaForCausalLM
llama-3-2-3b-instruct-pd-rt	transformers	safetensors	LlamaForCausalLM
llama-3-2-3b-instruct-rt	transformers	safetensors	LlamaForCausalLM
llama-3-2-90b-vision-instruct-rt	transformers	safetensors	MllamaForConditionalGeneration
llama-3-3-70b-instruct-pd-rt	transformers	safetensors	LlamaForCausalLM
llama-3-3-70b-instruct-rt	transformers	safetensors	LlamaForCausalLM
llama-4-maverick-17b-128e-instruct-fp8-pd-rt	transformers	safetensors	Llama4ForConditionalGeneration
llama-4-maverick-17b-128e-instruct-fp8-rt	transformers	safetensors	Llama4ForConditionalGeneration
llama-4-scout-17b-16e-instruct-pd-rt	transformers	safetensors	Llama4ForConditionalGeneration
llama-4-scout-17b-16e-instruct-rt	transformers	safetensors	Llama4ForConditionalGeneration
mistral-7b-instruct-pd-rt	transformers	safetensors	MistralForCausalLM
mistral-7b-instruct-rt	transformers	safetensors	MistralForCausalLM
mixtral-8x7b-instruct-pd-rt	transformers	safetensors	MixtralForCausalLM
mixtral-8x7b-instruct-rt	transformers	safetensors	MixtralForCausalLM

VLLM Runtimes

Name	Model Framework	Model Format	Model Architecture
e5-mistral-7b-instruct-rt	transformers	safetensors	MistralModel
llama-3-1-405b-instruct-fp8-rt	transformers	safetensors	LlamaForCausalLM
llama-3-1-nemotron-nano-8b-v1-rt	transformers	safetensors	LlamaForCausalLM
llama-3-1-nemotron-ultra-253b-v1-rt	transformers	safetensors	DeciLMForCausalLM
llama-3-2-11b-vision-instruct-rt	transformers	safetensors	MllamaForConditionalGeneration
llama-3-2-1b-instruct-rt	transformers	safetensors	LlamaForCausalLM
llama-3-2-3b-instruct-rt	transformers	safetensors	LlamaForCausalLM
llama-3-3-70b-instruct-rt	transformers	safetensors	LlamaForCausalLM
llama-3-3-nemotron-super-49b-v1-rt	transformers	safetensors	DeciLMForCausalLM
llama-4-maverick-17b-128e-instruct-fp8-rt	transformers	safetensors	Llama4ForConditionalGeneration
llama-4-scout-17b-16e-instruct-rt	transformers	safetensors	Llama4ForConditionalGeneration
mistral-7b-instruct-rt	transformers	safetensors	MistralForCausalLM
mixtral-8x7b-instruct-rt	transformers	safetensors	MixtralForCausalLM

Spec Attributes

Available attributes in the ServingRuntime spec:

Core Configuration

Attribute	Description
`disabled`	Disables this runtime
`supportedModelFormats`	List of model format, architecture, and type supported by the current runtime
`supportedModelFormats[ ].name`	Name of the model format (deprecated, use `modelFormat.name` instead)
`supportedModelFormats[ ].modelFormat`	ModelFormat specification including name and version
`supportedModelFormats[ ].modelFormat.name`	Name of the model format, e.g., “safetensors”, “ONNX”, “TensorFlow SavedModel”
`supportedModelFormats[ ].modelFormat.version`	Version of the model format. Used in validating that a runtime supports a model. It Can be “major”, “major.minor” or “major.minor.patch”
`supportedModelFormats[ ].modelFramework`	ModelFramework specification including name and version
`supportedModelFormats[ ].modelFramework.name`	Name of the library, e.g., “transformer”, “TensorFlow”, “PyTorch”, “ONNX”, “TensorRTLLM”
`supportedModelFormats[ ].modelFramework.version`	Version of the framework library
`supportedModelFormats[ ].modelArchitecture`	Name of the model architecture, used in validating that a model is supported by a runtime, e.g., “LlamaForCausalLM”, “GemmaForCausalLM”
`supportedModelFormats[ ].quantization`	Quantization scheme applied to the model, e.g., “fp8”, “fbgemm_fp8”, “int4”
`supportedModelFormats[ ].autoSelect`	Set to true to allow the ServingRuntime to be used for automatic model placement if this model is specified with no explicit runtime. The default value is false.
`supportedModelFormats[ ].priority`	Priority of this serving runtime for auto selection. This is used to select the serving runtime if more than one serving runtime supports the same model format. The value should be greater than zero. The higher the value, the higher the priority. Priority is not considered if AutoSelect is either false or not specified. Priority can be overridden by specifying the runtime in the InferenceService.
`protocolVersions`	Supported protocol versions (i.e. openAI or cohere or openInference-v1 or openInference-v2)
`modelSizeRange`	Model size range is the range of model sizes supported by this runtime
`modelSizeRange.min`	Minimum size of the model in bytes
`modelSizeRange.max`	Maximum size of the model in bytes

Component Configuration

The ServingRuntime spec supports three main component configurations:

Engine Configuration

Attribute	Description
`engineConfig`	Engine configuration for model serving
`engineConfig.runner`	Container specification for the main engine container
`engineConfig.runner.image`	Container image for the engine
`engineConfig.runner.resources`	Kubernetes limits or requests
`engineConfig.runner.env`	List of environment variables to pass to the container
`engineConfig.minReplicas`	Minimum number of replicas, defaults to 1 but can be set to 0 to enable scale-to-zero
`engineConfig.maxReplicas`	Maximum number of replicas for autoscaling
`engineConfig.scaleTarget`	Integer target value for the autoscaler metric
`engineConfig.scaleMetric`	Scaling metric type (concurrency, rps, cpu, memory)
`engineConfig.volumes`	List of volumes that can be mounted by containers
`engineConfig.nodeSelector`	Node selector for pod scheduling
`engineConfig.affinity`	Affinity rules for pod scheduling
`engineConfig.tolerations`	Tolerations for pod scheduling
`engineConfig.leader`	Leader configuration for multi-node deployments
`engineConfig.worker`	Worker configuration for multi-node deployments

Router Configuration

Attribute	Description
`routerConfig`	Router configuration for request routing
`routerConfig.runner`	Container specification for the router container
`routerConfig.config`	Additional configuration parameters for the router
`routerConfig.minReplicas`	Minimum number of router replicas
`routerConfig.maxReplicas`	Maximum number of router replicas

Decoder Configuration

Attribute	Description
`decoderConfig`	Decoder configuration for PD (Prefill-Decode) disaggregated deployments
`decoderConfig.runner`	Container specification for the decoder container
`decoderConfig.minReplicas`	Minimum number of decoder replicas
`decoderConfig.maxReplicas`	Maximum number of decoder replicas
`decoderConfig.leader`	Leader configuration for multi-node decoder deployments
`decoderConfig.worker`	Worker configuration for multi-node decoder deployments

Multi-Node Configuration

For both engineConfig and decoderConfig, multi-node deployments are supported:

Attribute	Description
`leader`	Leader node configuration for coordinating distributed processing
`leader.runner`	Container specification for the leader node
`worker`	Worker nodes configuration for distributed processing
`worker.size`	Number of worker pod instances
`worker.runner`	Container specification for worker nodes

Note: ClusterServingRuntime support the use of template variables of the form {{.Variable}} inside the container spec. These should map to fields inside an InferenceService’s metadata object. The primary use of this is for passing in InferenceService-specific information, such as a name, to the runtime environment.

Using ClusterServingRuntimes

When users define predictor in their InferenceService, they can explicitly specify the name of a ClusterServingRuntime or ServingRuntime. For example:

apiVersion: ome.io/v1beta1
kind: InferenceService
metadata:
  name: mistral-7b-instruct
  namespace: mistral-7b-instruct
spec:
  engine:
    minReplicas: 1
    maxReplicas: 1
  model:
    name: mistral-7b-instruct
  runtime:
    name: srt-mistral-7b-instruct

Here, the runtime specified is srt-mistral-7b-instruct, so the OME controller will first search the namespace for a ServingRuntime with that name. If none exist, the controller will then search the list of ClusterServingRuntimes.

Users can also implicitly specify the runtime by setting the autoSelect field to true in the supportedModelFormats field of the ClusterServingRuntime.

apiVersion: ome.io/v1beta1
kind: InferenceService
metadata:
  name: mistral-7b-instruct
  namespace: mistral-7b-instruct
spec:
  engine:
    minReplicas: 1
    maxReplicas: 1
  model:
    name: mistral-7b-instruct

Runtime Selection Logic

The OME controller uses an enhanced runtime selection algorithm to automatically choose the best runtime for a given model. The selection process includes several steps:

Runtime Discovery

The controller searches for compatible runtimes in the following order:

Namespace-scoped ServingRuntimes in the same namespace as the InferenceService
Cluster-scoped ClusterServingRuntimes available across the cluster

Enabled Status

The runtime must not be disabled. Runtimes can be disabled by setting the disabled field to true in the ServingRuntime spec.

Model Format Support

The runtime must support the model’s complete format specification, which includes several components:

Model Format: The storage format of the model (e.g., “safetensors”, “ONNX”, “TensorFlow SavedModel”)
Model Format Version: The version of the model format (e.g., “1”, “2.0”)
Model Framework: The underlying framework or library (e.g., “transformer”, “TensorFlow”, “PyTorch”, “ONNX”, “TensorRTLLM”)
Model Framework Version: The version of the framework library (e.g., “4.0”, “2.1”)
Model Architecture: The specific model implementation (e.g., “LlamaForCausalLM”, “GemmaForCausalLM”, “MistralForCausalLM”)
Quantization: The quantization scheme applied to the model (e.g., “fp8”, “fbgemm_fp8”, “int4”)

All these attributes must match between the model and the runtime’s supportedModelFormats for the runtime to be considered compatible.

Model Size Range

The modelSizeRange field defines the minimum and maximum model sizes that the runtime can support. This field is optional, but when provided, it helps the controller identify a runtime that matches the model size within the specified range. If multiple runtimes meet the size requirement, the controller will choose the runtime with the range closest to the model size.

Protocol Version Support

The runtime must support the requested protocol version. Protocol versions include:

openAI: OpenAI-compatible API format
openInference-v1: Open Inference Protocol version 1
openInference-v2: Open Inference Protocol version 2

If no protocol version is specified in the InferenceService, the controller defaults to openAI.

Auto-Selection

The runtime must have autoSelect enabled for at least one supported format. This ensures that only runtimes explicitly marked for automatic selection are considered during the selection process.

Priority

If more than one serving runtime supports the same model architecture, format, framework, quantization, and size range with same version, then we can optionally specify priority for the serving runtime. Based on the priority the runtime is automatically selected if no runtime is explicitly specified. Note that, priority is valid only if autoSelect is true. Higher value means higher priority.

For example, let’s consider the serving runtimes srt-mistral-7b-instruct and srt-mistral-7b-instruct-2. Both the serving runtimes support the MistralForCausalLM model architecture, transformers model framework, safetensors model format, version 1 and both supports the protocolVersion openAI. Also note that autoSelect is enabled in both the serving runtimes.

apiVersion: ome.io/v1beta1
kind: ClusterServingRuntime
metadata:
  name: srt-mistral-7b-instruct
spec:
  supportedModelFormats:
    - name: safetensors
      modelFormat:
        name: safetensors
        version: "1.0.0"
      modelFramework:
        name: transformers
        version: "4.36.2"
      modelArchitecture: MistralForCausalLM
      autoSelect: true
      priority: 1
  protocolVersions:
    - openAI
  modelSizeRange:
    max: 9B
    min: 5B
  engineConfig:
    runner:
      image: lmsysorg/sglang:v0.4.6.post6
      resources:
        requests:
          cpu: 10
          memory: 30Gi
          nvidia.com/gpu: 2
        limits:
          cpu: 10
          memory: 30Gi
          nvidia.com/gpu: 2
    minReplicas: 1
    maxReplicas: 3

apiVersion: ome.io/v1beta1
kind: ClusterServingRuntime
metadata:
  name: srt-mistral-7b-instruct-2
spec:
  supportedModelFormats:
    - name: safetensors
      modelFormat:
        name: safetensors
        version: "1.0.0"
      modelFramework:
        name: transformers
        version: "4.36.2"
      modelArchitecture: MistralForCausalLM
      autoSelect: true
      priority: 2
  protocolVersions:
    - openAI
  modelSizeRange:
    max: 9B
    min: 5B
  engineConfig:
    runner:
      image: lmsysorg/sglang:v0.4.6.post6
      resources:
        requests:
          cpu: 10
          memory: 30Gi
          nvidia.com/gpu: 2
        limits:
          cpu: 10
          memory: 30Gi
          nvidia.com/gpu: 2

Constraints of priority

The higher priority value means higher precedence. The value must be greater than 0.
The priority is valid only if auto select is enabled otherwise the priority is not considered.
The serving runtime with priority takes precedence over the serving runtime with priority not specified.
Two support model formats with the same name and the same version cannot have the same priority.
If more than one serving runtime supports the model format and none of them specified the priority then, there is no guarantee which runtime will be selected.
If a serving runtime supports multiple versions of a models, then it should have the same priority.

!!! Warning If multiple runtimes list the same format and/or version as auto-selectable and the priority is not specified, the runtime is selected based on the creationTimestamp i.e. the most recently created runtime is selected. So there is no guarantee which runtime will be selected. So users and cluster-administrators should enable autoSelect with care.

Feedback

Was this page helpful?

Glad to hear it! Please tell us how we can improve.

Sorry to hear that. Please tell us how we can improve.

Last modified July 2, 2025: Renames SRT to SGLang for clarity (#125) (c5bd92b)