Inference Service

InferenceService is the primary resource that manages the deployment and serving of machine learning models in OME.

What is an InferenceService?

An InferenceService is the central Kubernetes resource in OME that orchestrates the complete lifecycle of model serving. It acts as a declarative specification that describes how you want your AI models deployed, scaled, and served across your cluster.

Think of InferenceService as the “deployment blueprint” for your AI workloads. It brings together models (defined by BaseModel/ClusterBaseModel), runtimes (defined by ServingRuntime/ClusterServingRuntime), and infrastructure configuration to create a complete serving solution.

Architecture Overview

OME uses a component-based architecture where InferenceService can be composed of multiple specialized components:

  • Model: References the AI model to serve (BaseModel/ClusterBaseModel)
  • Runtime: References the serving runtime environment (ServingRuntime/ClusterServingRuntime)
  • Engine: Main inference component that processes requests
  • Decoder: Optional component for disaggregated serving (prefill-decode separation)
  • Router: Optional component for request routing and load balancing

New vs Deprecated Architecture

apiVersion: ome.io/v1beta1
kind: InferenceService
spec:
  model:
    name: llama-3-70b-instruct
  runtime:
    name: vllm-text-generation
  engine:
    minReplicas: 1
    maxReplicas: 3
    resources:
      requests:
        nvidia.com/gpu: "1"

Component Types

Engine Component

The Engine is the primary inference component that processes model requests. It handles model loading, inference execution, and response generation.

spec:
  engine:
    # Pod-level configuration
    serviceAccountName: custom-sa
    nodeSelector:
      accelerator: nvidia-a100
    
    # Component configuration  
    minReplicas: 1
    maxReplicas: 10
    scaleMetric: cpu
    scaleTarget: 70
    
    # Container configuration
    runner:
      image: custom-vllm:latest
      resources:
        requests:
          nvidia.com/gpu: "2"
        limits:
          nvidia.com/gpu: "2"
      env:
        - name: CUDA_VISIBLE_DEVICES
          value: "0,1"

Decoder Component

The Decoder is used for disaggregated serving architectures where the prefill (prompt processing) and decode (token generation) phases are separated for better resource utilization.

spec:
  decoder:
    minReplicas: 2
    maxReplicas: 8
    runner:
      resources:
        requests:
          cpu: "4"
          memory: "8Gi"

Router Component

The Router handles request routing, cache awareness load balancing, or prefill and decode disaggregation load balancing.

spec:
  router:
    minReplicas: 1
    maxReplicas: 3
    config:
      routing_strategy: "round_robin"
      health_check_interval: "30s"
    runner:
      resources:
        requests:
          cpu: "1"
          memory: "2Gi"

Deployment Modes

OME automatically selects the optimal deployment mode based on your configuration:

ModeDescriptionUse CasesInfrastructure
Raw DeploymentStandard Kubernetes DeploymentStable workloads, predictable traffic, no cold startsKubernetes Deployments + Services
ServerlessKnative-based auto-scalingVariable workloads, cost optimization, scale-to-zeroKnative Serving
Multi-NodeDistributed inference across multiple nodesLarge models (DeepSeek), models that can not fit in a single nodeLeaderWorkerSet
Prefill-Decode DisaggregationDisaggregated serving architectureMaximizing resource utilization, better performance,Raw Deployments or LeaderWorkerSet(if the model can not fit in a single node)

Raw Deployment Mode (Default)

Uses standard Kubernetes Deployments with full control over pod lifecycle and scaling.

apiVersion: ome.io/v1beta1
kind: InferenceService
metadata:
  name: llama-chat
spec:
  model:
    name: llama-3-70b-instruct
  engine:
    minReplicas: 2
    maxReplicas: 10

This deployment mode offers direct Kubernetes management with standard HPA-based autoscaling, no cold starts, and is ideal for stable, predictable workloads.

Serverless Mode

Leverages Knative Serving for automatic scaling including scale-to-zero capabilities.

apiVersion: ome.io/v1beta1
kind: InferenceService  
metadata:
  name: llama-chat
spec:
  model:
    name: llama-3-70b-instruct
  engine:
    minReplicas: 0  # Enables scale-to-zero
    maxReplicas: 10
    scaleTarget: 10  # Concurrent requests per pod

This deployment mode leverages Knative Serving for request-based autoscaling, scale-to-zero when idle, and is ideal for variable workloads and cost-sensitive environments.

⚠️ WARNING: This deployment mode leverages Knative Serving for request-based autoscaling, scale-to-zero when idle, and is ideal for variable workloads and cost-sensitive environments; however, it may introduce additional startup latency for large language models due to cold starts and model loading time.

Multi-Node Mode

Enables distributed model serving across multiple nodes using LeaderWorkerSet or Ray clusters.

apiVersion: ome.io/v1beta1
kind: InferenceService
metadata:
  name: deepseek-chat
spec:
  model:
    name: deepseek-r1  # Large model requiring multiple GPUs
  engine:
    minReplicas: 1
    maxReplicas: 2
    # Worker node configuration  
    worker:
      size: 1  # Number of worker nodes

This deployment mode enables distributed inference using LeaderWorkerSet or Ray, with support for multi-GPU and multi-node setups, and is optimized for large language models through automatic coordination between nodes

⚠️ WARNING: Multi-node configurations typically require high-performance networking such as RoCE or InfiniBand, and performance may vary depending on the underlying network topology and hardware provided by different cloud vendors.

Disaggregated Serving (Prefill-Decode)

apiVersion: ome.io/v1beta1
kind: InferenceService
metadata:
  name: deepseek-ep-disaggregated
spec:
  model:
    name: deepseek-r1
  
  # Router handles request routing and load balancing for prefill-decode disaggregation
  router:
    minReplicas: 1
    maxReplicas: 3
  
  # Engine handles prefill phase
  engine:
    minReplicas: 1
    maxReplicas: 3
  
  # Decoder handles token generation
  decoder:
    minReplicas: 2
    maxReplicas: 8

Specification Reference

AttributeTypeDescription
Core References
modelModelRefReference to BaseModel/ClusterBaseModel to serve
runtimeServingRuntimeRefReference to ServingRuntime/ClusterServingRuntime to use
Components
engineEngineSpecMain inference component configuration
decoderDecoderSpecOptional decoder component for disaggregated serving
routerRouterSpecOptional router component for request routing
Autoscaling
kedaConfigKedaConfigKEDA event-driven autoscaling configuration

ModelRef Specification

AttributeTypeDescription
namestringName of the BaseModel/ClusterBaseModel
kindstringResource kind (defaults to “ClusterBaseModel”)
apiGroupstringAPI group (defaults to “ome.io”)
fineTunedWeights[]stringOptional fine-tuned weight references

ServingRuntimeRef Specification

AttributeTypeDescription
namestringName of the ServingRuntime/ClusterServingRuntime
kindstringResource kind (defaults to “ClusterServingRuntime”)
apiGroupstringAPI group (defaults to “ome.io”)

Component Configuration

All components (Engine, Decoder, Router) share this common configuration structure:

AttributeTypeDescription
Pod Configuration
serviceAccountNamestringService account for the component pods
nodeSelectormap[string]stringNode labels for pod placement
tolerations[]TolerationPod tolerations for tainted nodes
affinityAffinityPod affinity and anti-affinity rules
volumes[]VolumeAdditional volumes to mount
containers[]ContainerAdditional sidecar containers
Scaling Configuration
minReplicasintMinimum number of replicas (default: 1)
maxReplicasintMaximum number of replicas
scaleTargetintTarget value for autoscaling metric
scaleMetricstringMetric to use for scaling (cpu, memory, concurrency, rps)
containerConcurrencyint64Maximum concurrent requests per container
timeoutSecondsint64Request timeout in seconds
Traffic Management
canaryTrafficPercentint64Percentage of traffic to route to canary version
Resource Configuration
runnerRunnerSpecMain container configuration
leaderLeaderSpecLeader node configuration (multi-node only)
workerWorkerSpecWorker node configuration (multi-node only)
Deployment Strategy
deploymentStrategyDeploymentStrategyKubernetes deployment strategy (RawDeployment only)
KEDA Configuration
kedaConfigKedaConfigComponent-specific KEDA configuration

RunnerSpec Configuration

AttributeTypeDescription
namestringContainer name
imagestringContainer image
command[]stringContainer command
args[]stringContainer arguments
env[]EnvVarEnvironment variables
resourcesResourceRequirementsCPU, memory, and GPU resource requirements
volumeMounts[]VolumeMountVolume mount points

KEDA Configuration

AttributeTypeDescription
enableKedaboolWhether to enable KEDA autoscaling
promServerAddressstringPrometheus server URL for metrics
customPromQuerystringCustom Prometheus query for scaling
scalingThresholdstringThreshold value for scaling decisions
scalingOperatorstringComparison operator (GreaterThanOrEqual, LessThanOrEqual)

Status and Monitoring

InferenceService Status

The InferenceService status provides comprehensive information about the deployment state:

status:
  url: "http://llama-chat.default.example.com"
  address:
    url: "http://llama-chat.default.svc.cluster.local"
  conditions:
    - type: Ready
      status: "True"
      lastTransitionTime: "2024-01-15T10:30:00Z"
    - type: IngressReady
      status: "True"
      lastTransitionTime: "2024-01-15T10:25:00Z"
  components:
    engine:
      url: "http://llama-chat-engine.default.example.com"
      latestReadyRevision: "llama-chat-engine-00001"
      latestCreatedRevision: "llama-chat-engine-00001"
      traffic:
        - revisionName: "llama-chat-engine-00001"
          percent: 100
          latestRevision: true
    router:
      url: "http://llama-chat-router.default.example.com"
      latestReadyRevision: "llama-chat-router-00001"
  modelStatus:
    transitionStatus: "UpToDate"
    modelRevisionStates:
      activeModelState: "Loaded"
      targetModelState: "Loaded"

Condition Types

ConditionDescription
ReadyOverall readiness of the InferenceService
IngressReadyNetwork routing is configured and ready
EngineReadyEngine component is ready to serve requests
DecoderReadyDecoder component is ready (if configured)
RouterReadyRouter component is ready (if configured)
PredictorReadyDeprecated: Legacy predictor readiness

Model Status States

StateDescription
PendingModel is not yet registered
StandbyModel is available but not loaded
LoadingModel is currently loading
LoadedModel is loaded and ready for inference
FailedToLoadModel failed to load

Deployment Mode Selection

Choose the appropriate deployment mode based on your requirements:

RequirementRecommended Mode
Stable, predictable loadRaw Deployment
No cold startsRaw Deployment
Variable workloadServerless
Cost optimizationServerless
Scale-to-zero capabilityServerless
Large model requiring multiple GPUsMulti-Node
Distributed inferenceMulti-Node
Maximum performanceMulti-Node

Best Practices

Resource Management

  1. GPU Allocation: Always specify GPU resources explicitly
runner:
  resources:
    requests:
      nvidia.com/gpu: "1"
    limits:
      nvidia.com/gpu: "1"
  1. Memory Sizing: Allow 2-4x model size for memory
runner:
  resources:
    requests:
      memory: "32Gi"  # For 8B parameter model
  1. CPU Allocation: Provide adequate CPU for preprocessing
runner:
  resources:
    requests:
      cpu: "4"

Scaling Configuration

  1. Set Appropriate Limits:
engine:
  minReplicas: 1     # Prevent scale-to-zero for latency
  maxReplicas: 10    # Control costs
  scaleTarget: 70    # 70% CPU utilization target
  1. Use KEDA for Custom Metrics:
kedaConfig:
  enableKeda: true
  customPromQuery: "avg_over_time(vllm:request_latency_seconds{service='%s'}[5m])"
  scalingThreshold: "0.5"  # 500ms latency threshold

Troubleshooting

  1. Check Component Status:
kubectl get inferenceservice llama-chat -o yaml
kubectl describe inferenceservice llama-chat
  1. Monitor Pod Logs:
kubectl logs -l serving.ome.io/inferenceservice=llama-chat
  1. Check Resource Usage:
kubectl top pods -l serving.ome.io/inferenceservice=llama-chat

Last modified June 26, 2025: Fix all doc prefixes (#79) (abf6c54)