Serving Runtime

Cluster Serving Runtime is a cluster-scoped resource that manages the runtime environment for model serving.

The only difference between the two is that one is namespace-scoped and the other is cluster-scoped.

A ClusterServingRuntime defines the templates for Pods that can serve one or more particular model. Each ClusterServingRuntime defines key information such as the container image of the runtime and a list of the models that the runtime supports. Other configuration settings for the runtime can be conveyed through environment variables in the container specification.

These CRDs allow for improved flexibility and extensibility, enabling users to quickly define or customize reusable runtimes without having to modify any controller code or any resources in the controller namespace.

The following is an example of a ClusterServingRuntime:

apiVersion: ome.io/v1beta1
kind: ClusterServingRuntime
metadata:
  name: srt-mistral-7b-instruct
spec:
  supportedModelFormats:
    - name: safetensors
      modelFormat:
        name: safetensors
        version: "1.0.0"
      modelFramework:
        name: transformers
        version: "4.36.2"
      modelArchitecture: MistralForCausalLM
      autoSelect: true
      priority: 1
  protocolVersions:
    - openAI
  modelSizeRange:
    max: 9B
    min: 5B
  engineConfig:
    runner:
      image: lmsysorg/sglang:v0.4.6.post6
      resources:
        requests:
          cpu: 10
          memory: 30Gi
          nvidia.com/gpu: 2
        limits:
          cpu: 10
          memory: 30Gi
          nvidia.com/gpu: 2
    minReplicas: 1
    maxReplicas: 3

Several out-of-the-box ClusterServingRuntimes are provided with OME so that users can quickly deploy common models without having to define the runtimes themselves.

SGLang Runtimes

Note: SGLang is our flagship supporting runtime, offering the latest serving engine with the most optimal performance. It provides cutting-edge features including multi-node serving capabilities, prefill-decode disaggregated serving, and Large-scale Cross-node Expert Parallelism (EP) for optimal performance at scale.

NameModel FrameworkModel FormatModel Architecture
deepseek-rdma-pd-rttransformerssafetensorsDeepseekV3ForCausalLM
deepseek-rdma-rttransformerssafetensorsDeepseekV3ForCausalLM
e5-mistral-7b-instruct-rttransformerssafetensorsMistralModel
llama-3-1-70b-instruct-pd-rttransformerssafetensorsLlamaForCausalLM
llama-3-1-70b-instruct-rttransformerssafetensorsLlamaForCausalLM
llama-3-2-11b-vision-instruct-rttransformerssafetensorsMllamaForConditionalGeneration
llama-3-2-1b-instruct-pd-rttransformerssafetensorsLlamaForCausalLM
llama-3-2-1b-instruct-rttransformerssafetensorsLlamaForCausalLM
llama-3-2-3b-instruct-pd-rttransformerssafetensorsLlamaForCausalLM
llama-3-2-3b-instruct-rttransformerssafetensorsLlamaForCausalLM
llama-3-2-90b-vision-instruct-rttransformerssafetensorsMllamaForConditionalGeneration
llama-3-3-70b-instruct-pd-rttransformerssafetensorsLlamaForCausalLM
llama-3-3-70b-instruct-rttransformerssafetensorsLlamaForCausalLM
llama-4-maverick-17b-128e-instruct-fp8-pd-rttransformerssafetensorsLlama4ForConditionalGeneration
llama-4-maverick-17b-128e-instruct-fp8-rttransformerssafetensorsLlama4ForConditionalGeneration
llama-4-scout-17b-16e-instruct-pd-rttransformerssafetensorsLlama4ForConditionalGeneration
llama-4-scout-17b-16e-instruct-rttransformerssafetensorsLlama4ForConditionalGeneration
mistral-7b-instruct-pd-rttransformerssafetensorsMistralForCausalLM
mistral-7b-instruct-rttransformerssafetensorsMistralForCausalLM
mixtral-8x7b-instruct-pd-rttransformerssafetensorsMixtralForCausalLM
mixtral-8x7b-instruct-rttransformerssafetensorsMixtralForCausalLM

VLLM Runtimes

NameModel FrameworkModel FormatModel Architecture
e5-mistral-7b-instruct-rttransformerssafetensorsMistralModel
llama-3-1-405b-instruct-fp8-rttransformerssafetensorsLlamaForCausalLM
llama-3-1-nemotron-nano-8b-v1-rttransformerssafetensorsLlamaForCausalLM
llama-3-1-nemotron-ultra-253b-v1-rttransformerssafetensorsDeciLMForCausalLM
llama-3-2-11b-vision-instruct-rttransformerssafetensorsMllamaForConditionalGeneration
llama-3-2-1b-instruct-rttransformerssafetensorsLlamaForCausalLM
llama-3-2-3b-instruct-rttransformerssafetensorsLlamaForCausalLM
llama-3-3-70b-instruct-rttransformerssafetensorsLlamaForCausalLM
llama-3-3-nemotron-super-49b-v1-rttransformerssafetensorsDeciLMForCausalLM
llama-4-maverick-17b-128e-instruct-fp8-rttransformerssafetensorsLlama4ForConditionalGeneration
llama-4-scout-17b-16e-instruct-rttransformerssafetensorsLlama4ForConditionalGeneration
mistral-7b-instruct-rttransformerssafetensorsMistralForCausalLM
mixtral-8x7b-instruct-rttransformerssafetensorsMixtralForCausalLM

Spec Attributes

Available attributes in the ServingRuntime spec:

Core Configuration

AttributeDescription
disabledDisables this runtime
supportedModelFormatsList of model format, architecture, and type supported by the current runtime
supportedModelFormats[ ].nameName of the model format (deprecated, use modelFormat.name instead)
supportedModelFormats[ ].modelFormatModelFormat specification including name and version
supportedModelFormats[ ].modelFormat.nameName of the model format, e.g., “safetensors”, “ONNX”, “TensorFlow SavedModel”
supportedModelFormats[ ].modelFormat.versionVersion of the model format. Used in validating that a runtime supports a model. It Can be “major”, “major.minor” or “major.minor.patch”
supportedModelFormats[ ].modelFrameworkModelFramework specification including name and version
supportedModelFormats[ ].modelFramework.nameName of the library, e.g., “transformer”, “TensorFlow”, “PyTorch”, “ONNX”, “TensorRTLLM”
supportedModelFormats[ ].modelFramework.versionVersion of the framework library
supportedModelFormats[ ].modelArchitectureName of the model architecture, used in validating that a model is supported by a runtime, e.g., “LlamaForCausalLM”, “GemmaForCausalLM”
supportedModelFormats[ ].quantizationQuantization scheme applied to the model, e.g., “fp8”, “fbgemm_fp8”, “int4”
supportedModelFormats[ ].autoSelectSet to true to allow the ServingRuntime to be used for automatic model placement if this model is specified with no explicit runtime. The default value is false.
supportedModelFormats[ ].priorityPriority of this serving runtime for auto selection. This is used to select the serving runtime if more than one serving runtime supports the same model format.
The value should be greater than zero. The higher the value, the higher the priority. Priority is not considered if AutoSelect is either false or not specified. Priority can be overridden by specifying the runtime in the InferenceService.
protocolVersionsSupported protocol versions (i.e. openAI or cohere or openInference-v1 or openInference-v2)
modelSizeRangeModel size range is the range of model sizes supported by this runtime
modelSizeRange.minMinimum size of the model in bytes
modelSizeRange.maxMaximum size of the model in bytes

Component Configuration

The ServingRuntime spec supports three main component configurations:

Engine Configuration

AttributeDescription
engineConfigEngine configuration for model serving
engineConfig.runnerContainer specification for the main engine container
engineConfig.runner.imageContainer image for the engine
engineConfig.runner.resourcesKubernetes limits or requests
engineConfig.runner.envList of environment variables to pass to the container
engineConfig.minReplicasMinimum number of replicas, defaults to 1 but can be set to 0 to enable scale-to-zero
engineConfig.maxReplicasMaximum number of replicas for autoscaling
engineConfig.scaleTargetInteger target value for the autoscaler metric
engineConfig.scaleMetricScaling metric type (concurrency, rps, cpu, memory)
engineConfig.volumesList of volumes that can be mounted by containers
engineConfig.nodeSelectorNode selector for pod scheduling
engineConfig.affinityAffinity rules for pod scheduling
engineConfig.tolerationsTolerations for pod scheduling
engineConfig.leaderLeader configuration for multi-node deployments
engineConfig.workerWorker configuration for multi-node deployments

Router Configuration

AttributeDescription
routerConfigRouter configuration for request routing
routerConfig.runnerContainer specification for the router container
routerConfig.configAdditional configuration parameters for the router
routerConfig.minReplicasMinimum number of router replicas
routerConfig.maxReplicasMaximum number of router replicas

Decoder Configuration

AttributeDescription
decoderConfigDecoder configuration for PD (Prefill-Decode) disaggregated deployments
decoderConfig.runnerContainer specification for the decoder container
decoderConfig.minReplicasMinimum number of decoder replicas
decoderConfig.maxReplicasMaximum number of decoder replicas
decoderConfig.leaderLeader configuration for multi-node decoder deployments
decoderConfig.workerWorker configuration for multi-node decoder deployments

Multi-Node Configuration

For both engineConfig and decoderConfig, multi-node deployments are supported:

AttributeDescription
leaderLeader node configuration for coordinating distributed processing
leader.runnerContainer specification for the leader node
workerWorker nodes configuration for distributed processing
worker.sizeNumber of worker pod instances
worker.runnerContainer specification for worker nodes

Note: ClusterServingRuntime support the use of template variables of the form {{.Variable}} inside the container spec. These should map to fields inside an InferenceService’s metadata object. The primary use of this is for passing in InferenceService-specific information, such as a name, to the runtime environment.

Using ClusterServingRuntimes

When users define predictor in their InferenceService, they can explicitly specify the name of a ClusterServingRuntime or ServingRuntime. For example:

apiVersion: ome.io/v1beta1
kind: InferenceService
metadata:
  name: mistral-7b-instruct
  namespace: mistral-7b-instruct
spec:
  engine:
    minReplicas: 1
    maxReplicas: 1
  model:
    name: mistral-7b-instruct
  runtime:
    name: srt-mistral-7b-instruct

Here, the runtime specified is srt-mistral-7b-instruct, so the OME controller will first search the namespace for a ServingRuntime with that name. If none exist, the controller will then search the list of ClusterServingRuntimes.

Users can also implicitly specify the runtime by setting the autoSelect field to true in the supportedModelFormats field of the ClusterServingRuntime.

apiVersion: ome.io/v1beta1
kind: InferenceService
metadata:
  name: mistral-7b-instruct
  namespace: mistral-7b-instruct
spec:
  engine:
    minReplicas: 1
    maxReplicas: 1
  model:
    name: mistral-7b-instruct

Runtime Selection Logic

The OME controller uses an enhanced runtime selection algorithm to automatically choose the best runtime for a given model. The selection process includes several steps:

Runtime Discovery

The controller searches for compatible runtimes in the following order:

  1. Namespace-scoped ServingRuntimes in the same namespace as the InferenceService
  2. Cluster-scoped ClusterServingRuntimes available across the cluster

Enabled Status

The runtime must not be disabled. Runtimes can be disabled by setting the disabled field to true in the ServingRuntime spec.

Model Format Support

The runtime must support the model’s complete format specification, which includes several components:

  • Model Format: The storage format of the model (e.g., “safetensors”, “ONNX”, “TensorFlow SavedModel”)
  • Model Format Version: The version of the model format (e.g., “1”, “2.0”)
  • Model Framework: The underlying framework or library (e.g., “transformer”, “TensorFlow”, “PyTorch”, “ONNX”, “TensorRTLLM”)
  • Model Framework Version: The version of the framework library (e.g., “4.0”, “2.1”)
  • Model Architecture: The specific model implementation (e.g., “LlamaForCausalLM”, “GemmaForCausalLM”, “MistralForCausalLM”)
  • Quantization: The quantization scheme applied to the model (e.g., “fp8”, “fbgemm_fp8”, “int4”)

All these attributes must match between the model and the runtime’s supportedModelFormats for the runtime to be considered compatible.

Model Size Range

The modelSizeRange field defines the minimum and maximum model sizes that the runtime can support. This field is optional, but when provided, it helps the controller identify a runtime that matches the model size within the specified range. If multiple runtimes meet the size requirement, the controller will choose the runtime with the range closest to the model size.

Protocol Version Support

The runtime must support the requested protocol version. Protocol versions include:

  • openAI: OpenAI-compatible API format
  • openInference-v1: Open Inference Protocol version 1
  • openInference-v2: Open Inference Protocol version 2

If no protocol version is specified in the InferenceService, the controller defaults to openAI.

Auto-Selection

The runtime must have autoSelect enabled for at least one supported format. This ensures that only runtimes explicitly marked for automatic selection are considered during the selection process.

Priority

If more than one serving runtime supports the same model architecture, format, framework, quantization, and size range with same version, then we can optionally specify priority for the serving runtime. Based on the priority the runtime is automatically selected if no runtime is explicitly specified. Note that, priority is valid only if autoSelect is true. Higher value means higher priority.

For example, let’s consider the serving runtimes srt-mistral-7b-instruct and srt-mistral-7b-instruct-2. Both the serving runtimes support the MistralForCausalLM model architecture, transformers model framework, safetensors model format, version 1 and both supports the protocolVersion openAI. Also note that autoSelect is enabled in both the serving runtimes.

apiVersion: ome.io/v1beta1
kind: ClusterServingRuntime
metadata:
  name: srt-mistral-7b-instruct
spec:
  supportedModelFormats:
    - name: safetensors
      modelFormat:
        name: safetensors
        version: "1.0.0"
      modelFramework:
        name: transformers
        version: "4.36.2"
      modelArchitecture: MistralForCausalLM
      autoSelect: true
      priority: 1
  protocolVersions:
    - openAI
  modelSizeRange:
    max: 9B
    min: 5B
  engineConfig:
    runner:
      image: lmsysorg/sglang:v0.4.6.post6
      resources:
        requests:
          cpu: 10
          memory: 30Gi
          nvidia.com/gpu: 2
        limits:
          cpu: 10
          memory: 30Gi
          nvidia.com/gpu: 2
    minReplicas: 1
    maxReplicas: 3
apiVersion: ome.io/v1beta1
kind: ClusterServingRuntime
metadata:
  name: srt-mistral-7b-instruct-2
spec:
  supportedModelFormats:
    - name: safetensors
      modelFormat:
        name: safetensors
        version: "1.0.0"
      modelFramework:
        name: transformers
        version: "4.36.2"
      modelArchitecture: MistralForCausalLM
      autoSelect: true
      priority: 2
  protocolVersions:
    - openAI
  modelSizeRange:
    max: 9B
    min: 5B
  engineConfig:
    runner:
      image: lmsysorg/sglang:v0.4.6.post6
      resources:
        requests:
          cpu: 10
          memory: 30Gi
          nvidia.com/gpu: 2
        limits:
          cpu: 10
          memory: 30Gi
          nvidia.com/gpu: 2

Constraints of priority

  • The higher priority value means higher precedence. The value must be greater than 0.
  • The priority is valid only if auto select is enabled otherwise the priority is not considered.
  • The serving runtime with priority takes precedence over the serving runtime with priority not specified.
  • Two support model formats with the same name and the same version cannot have the same priority.
  • If more than one serving runtime supports the model format and none of them specified the priority then, there is no guarantee which runtime will be selected.
  • If a serving runtime supports multiple versions of a models, then it should have the same priority.

!!! Warning If multiple runtimes list the same format and/or version as auto-selectable and the priority is not specified, the runtime is selected based on the creationTimestamp i.e. the most recently created runtime is selected. So there is no guarantee which runtime will be selected. So users and cluster-administrators should enable autoSelect with care.