OME API

Generated API reference documentation for ome.io/v1beta1.

Package v1beta1 contains API Schema definitions for the serving v1beta1 API group

Resource Types

BaseModel

Appears in:

BaseModel is the Schema for the basemodels API

FieldDescription
apiVersion
string
ome.io/v1beta1
kind
string
BaseModel
spec [Required]
BaseModelSpec
No description provided.
status [Required]
ModelStatusSpec
No description provided.

BenchmarkJob

Appears in:

BenchmarkJob is the schema for the BenchmarkJobs API

FieldDescription
apiVersion
string
ome.io/v1beta1
kind
string
BenchmarkJob
spec [Required]
BenchmarkJobSpec
No description provided.
status [Required]
BenchmarkJobStatus
No description provided.

ClusterBaseModel

Appears in:

ClusterBaseModel is the Schema for the basemodels API

FieldDescription
apiVersion
string
ome.io/v1beta1
kind
string
ClusterBaseModel
spec [Required]
BaseModelSpec
No description provided.
status [Required]
ModelStatusSpec
No description provided.

ClusterServingRuntime

Appears in:

ClusterServingRuntime is the Schema for the servingruntimes API

FieldDescription
apiVersion
string
ome.io/v1beta1
kind
string
ClusterServingRuntime
spec [Required]
ServingRuntimeSpec
No description provided.
status [Required]
ServingRuntimeStatus
No description provided.

FineTunedWeight

Appears in:

FineTunedWeight is the Schema for the finetunedweights API

FieldDescription
apiVersion
string
ome.io/v1beta1
kind
string
FineTunedWeight
spec [Required]
FineTunedWeightSpec
No description provided.
status [Required]
ModelStatusSpec
No description provided.

InferenceService

Appears in:

InferenceService is the Schema for the InferenceServices API

FieldDescription
apiVersion
string
ome.io/v1beta1
kind
string
InferenceService
spec [Required]
InferenceServiceSpec
No description provided.
status [Required]
InferenceServiceStatus
No description provided.

ServingRuntime

Appears in:

ServingRuntime is the Schema for the servingruntimes API

FieldDescription
apiVersion
string
ome.io/v1beta1
kind
string
ServingRuntime
spec [Required]
ServingRuntimeSpec
No description provided.
status [Required]
ServingRuntimeStatus
No description provided.

BaseModelSpec

Appears in:

BaseModelSpec defines the desired state of BaseModel

FieldDescription
modelFormat
ModelFormat
No description provided.
modelType
string

ModelType defines the architecture family of the model (e.g., "bert", "gpt2", "llama"). This value typically corresponds to the "model_type" field in a Hugging Face model's config.json. It is used to identify the transformer architecture and inform runtime selection and tokenizer behavior.

modelFramework
ModelFrameworkSpec

ModelFramework specifies the underlying framework used by the model, such as "ONNX", "TensorFlow", "PyTorch", "Transformer", or "TensorRTLLM". This value helps determine the appropriate runtime for model serving.

modelArchitecture
string

ModelArchitecture specifies the concrete model implementation or head, such as "LlamaForCausalLM", "GemmaForCausalLM", or "MixtralForCausalLM". This is often derived from the "architectures" field in Hugging Face config.json.

quantization
ModelQuantization

Quantization defines the quantization scheme applied to the model weights, such as "fp8", "fbgemm_fp8", or "int4". This influences runtime compatibility and performance.

modelParameterSize
string

ModelParameterSize indicates the total number of parameters in the model, expressed in human-readable form such as "7B", "13B", or "175B". This can be used for scheduling or runtime selection.

modelCapabilities
[]string

ModelCapabilities of the model, e.g., "TEXT_GENERATION", "TEXT_SUMMARIZATION", "TEXT_EMBEDDINGS"

modelConfiguration
k8s.io/apimachinery/pkg/runtime.RawExtension

Configuration of the model, stored as generic JSON for flexibility.

storage [Required]
StorageSpec

Storage configuration for the model

ModelExtensionSpec [Required]
ModelExtensionSpec
(Members of ModelExtensionSpec are embedded into this type.)

ModelExtension is the common extension of the model

servingMode [Required]
[]string
No description provided.
maxTokens
int32

MaxTokens is the maximum number of tokens that can be processed by the model

additionalMetadata
map[string]string

Additional metadata for the model

BenchmarkJobSpec

Appears in:

BenchmarkJobSpec defines the specification for a benchmark job. All fields within this specification collectively represent the desired state and configuration of a BenchmarkJob.

FieldDescription
huggingFaceSecretReference
HuggingFaceSecretReference

HuggingFaceSecretReference is a reference to a Kubernetes Secret containing the Hugging Face API key. The referenced Secret must reside in the same namespace as the BenchmarkJob. This field replaces the raw HuggingFaceAPIKey field for improved security.

endpoint [Required]
EndpointSpec

Endpoint is the reference to the inference service to benchmark.

serviceMetadata
ServiceMetadata

ServiceMetadata records metadata about the backend model server or service being benchmarked. This includes details such as server engine, version, and GPU configuration for filtering experiments.

task [Required]
string

Task specifies the task to benchmark, pattern: -to- (e.g., "text-to-text", "image-to-text").

trafficScenarios
[]string

TrafficScenarios contains a list of traffic scenarios to simulate during the benchmark. If not provided, defaults will be assigned via genai-bench.

numConcurrency
[]int

NumConcurrency defines a list of concurrency levels to test during the benchmark. If not provided, defaults will be assigned via genai-bench.

maxTimePerIteration [Required]
int

MaxTimePerIteration specifies the maximum time (in minutes) for a single iteration. Each iteration runs for a specific combination of TrafficScenarios and NumConcurrency.

maxRequestsPerIteration [Required]
int

MaxRequestsPerIteration specifies the maximum number of requests for a single iteration. Each iteration runs for a specific combination of TrafficScenarios and NumConcurrency.

additionalRequestParams
map[string]string

AdditionalRequestParams contains additional request parameters as a map.

dataset
StorageSpec

Dataset is the dataset used for benchmarking. It is optional and only required for tasks other than "text-to-".

outputLocation [Required]
StorageSpec

OutputLocation specifies where the benchmark results will be stored (e.g., object storage).

resultFolderName
string

ResultFolderName specifies the name of the folder that stores the benchmark result. A default name will be assigned if not specified.

podOverride
PodOverride

Pod defines the pod configuration for the benchmark job. This is optional, if not provided, default values will be used.

BenchmarkJobStatus

Appears in:

BenchmarkJobStatus reflects the state and results of the benchmark job. It will be set and updated by the controller.

FieldDescription
state [Required]
string

State represents the current state of the benchmark job: "Pending", "Running", "Completed", "Failed".

startTime
k8s.io/apimachinery/pkg/apis/meta/v1.Time

StartTime is the timestamp for when the benchmark job started.

completionTime
k8s.io/apimachinery/pkg/apis/meta/v1.Time

CompletionTime is the timestamp for when the benchmark job completed, either successfully or unsuccessfully.

lastReconcileTime
k8s.io/apimachinery/pkg/apis/meta/v1.Time

LastReconcileTime is the timestamp for the last time the job was reconciled by the controller.

failureMessage
string

FailureMessage contains any error messages if the benchmark job failed.

details
string

Details provide additional information or metadata about the benchmark job.

ComponentExtensionSpec

Appears in:

ComponentExtensionSpec defines the deployment configuration for a given InferenceService component

FieldDescription
minReplicas
int

Minimum number of replicas, defaults to 1 but can be set to 0 to enable scale-to-zero.

maxReplicas
int

Maximum number of replicas for autoscaling.

scaleTarget
int

ScaleTarget specifies the integer target value of the metric type the Autoscaler watches for. concurrency and rps targets are supported by Knative Pod Autoscaler (https://knative.dev/docs/serving/autoscaling/autoscaling-targets/).

scaleMetric
ScaleMetric

ScaleMetric defines the scaling metric type watched by autoscaler possible values are concurrency, rps, cpu, memory. concurrency, rps are supported via Knative Pod Autoscaler(https://knative.dev/docs/serving/autoscaling/autoscaling-metrics).

containerConcurrency
int64

ContainerConcurrency specifies how many requests can be processed concurrently, this sets the hard limit of the container concurrency(https://knative.dev/docs/serving/autoscaling/concurrency).

timeoutSeconds
int64

TimeoutSeconds specifies the number of seconds to wait before timing out a request to the component.

canaryTrafficPercent
int64

CanaryTrafficPercent defines the traffic split percentage between the candidate revision and the last ready revision

labels
map[string]string

Labels that will be add to the component pod. More info: http://kubernetes.io/docs/user-guide/labels

annotations
map[string]string

Annotations that will be add to the component pod. More info: http://kubernetes.io/docs/user-guide/annotations

deploymentStrategy
k8s.io/api/apps/v1.DeploymentStrategy

The deployment strategy to use to replace existing pods with new ones. Only applicable for raw deployment mode.

kedaConfig [Required]
KedaConfig
No description provided.

ComponentStatusSpec

Appears in:

ComponentStatusSpec describes the state of the component

FieldDescription
latestReadyRevision
string

Latest revision name that is in ready state

latestCreatedRevision
string

Latest revision name that is created

previousRolledoutRevision
string

Previous revision name that is rolled out with 100 percent traffic

latestRolledoutRevision
string

Latest revision name that is rolled out with 100 percent traffic

traffic
[]knative.dev/serving/pkg/apis/serving/v1.TrafficTarget

Traffic holds the configured traffic distribution for latest ready revision and previous rolled out revision.

url
knative.dev/pkg/apis.URL

URL holds the primary url that will distribute traffic over the provided traffic targets. This will be one the REST or gRPC endpoints that are available. It generally has the form http[s]://{route-name}.{route-namespace}.{cluster-level-suffix}

restURL
knative.dev/pkg/apis.URL

REST endpoint of the component if available.

address
knative.dev/pkg/apis/duck/v1.Addressable

Addressable endpoint for the InferenceService

DecoderSpec

Appears in:

DecoderSpec defines the configuration for the Decoder component (token generation in PD-disaggregated deployment) Used specifically for prefill-decode disaggregated deployments to handle the token generation phase. Similar to EngineSpec in structure, it allows for detailed pod and container configuration, but is specifically used for the decode phase when separating prefill and decode processes.

FieldDescription
PodSpec
PodSpec
(Members of PodSpec are embedded into this type.)

This spec provides a full PodSpec for the decoder component Allows complete customization of the Kubernetes Pod configuration including containers, volumes, security contexts, affinity rules, and other pod settings.

ComponentExtensionSpec [Required]
ComponentExtensionSpec
(Members of ComponentExtensionSpec are embedded into this type.)

ComponentExtensionSpec defines deployment configuration like min/max replicas, scaling metrics, etc. Controls scaling behavior and resource allocation for the decoder component.

runner
RunnerSpec

Runner container override for customizing the main container This is essentially a container spec that can override the default container Defines the main decoder container configuration, including image, resource requests/limits, environment variables, and command.

leader
LeaderSpec

Leader node configuration (only used for MultiNode deployment) Defines the pod and container spec for the leader node that coordinates distributed token generation in multi-node deployments.

worker
WorkerSpec

Worker nodes configuration (only used for MultiNode deployment) Defines the pod and container spec for worker nodes that perform distributed token generation tasks as directed by the leader.

Endpoint

Appears in:

Endpoint defines a direct URL-based inference service with additional API configuration.

FieldDescription
url [Required]
string

URL represents the endpoint URL for the inference service.

apiFormat [Required]
string

APIFormat specifies the type of API, such as "openai" or "oci-cohere".

modelName [Required]
string

ModelName specifies the name of the model being served at the endpoint. Useful for endpoints that require model-specific configuration. For instance, for openai API, this is a required field in the payload

EndpointSpec

Appears in:

EndpointSpec defines a reference to an inference service. It supports either a Kubernetes-style reference (InferenceService) or an Endpoint struct for a direct URL. Cross-namespace references are supported for InferenceService but require appropriate RBAC permissions to access resources in the target namespace.

FieldDescription
inferenceService
InferenceServiceReference

InferenceService holds a Kubernetes reference to an internal inference service.

endpoint
Endpoint

Endpoint holds the details of a direct endpoint for an external inference service, including URL and API details.

EngineSpec

Appears in:

EngineSpec defines the configuration for the Engine component (can be used for both single-node and multi-node deployments) Provides a comprehensive specification for deploying model serving containers and pods. It allows for complete Kubernetes pod configuration including main containers, init containers, sidecars, volumes, and other pod-level settings. For distributed deployments, it supports leader-worker architecture configuration.

FieldDescription
PodSpec
PodSpec
(Members of PodSpec are embedded into this type.)

This spec provides a full PodSpec for the engine component Allows complete customization of the Kubernetes Pod configuration including containers, volumes, security contexts, affinity rules, and other pod settings.

ComponentExtensionSpec [Required]
ComponentExtensionSpec
(Members of ComponentExtensionSpec are embedded into this type.)

ComponentExtensionSpec defines deployment configuration like min/max replicas, scaling metrics, etc. Controls scaling behavior and resource allocation for the engine component.

runner
RunnerSpec

Runner container override for customizing the engine container This is essentially a container spec that can override the default container Defines the main model runner container configuration, including image, resource requests/limits, environment variables, and command.

leader
LeaderSpec

Leader node configuration (only used for MultiNode deployment) Defines the pod and container spec for the leader node that coordinates distributed inference in multi-node deployments.

worker
WorkerSpec

Worker nodes configuration (only used for MultiNode deployment) Defines the pod and container spec for worker nodes that perform distributed processing tasks as directed by the leader.

FailureInfo

Appears in:

FieldDescription
location
string

Name of component to which the failure relates (usually Pod name)

reason
FailureReason

High level class of failure

message
string

Detailed error message

modelRevisionName
string

Internal Revision/ID of model, tied to specific Spec contents

time
k8s.io/apimachinery/pkg/apis/meta/v1.Time

Time failure occurred or was discovered

exitCode
int32

Exit status from the last termination of the container

FailureReason

(Alias of string)

Appears in:

FailureReason enum

FineTunedWeightSpec

Appears in:

FineTunedWeightSpec defines the desired state of FineTunedWeight

FieldDescription
baseModelRef [Required]
ObjectReference

Reference to the base model that this weight is fine-tuned from

modelType [Required]
string

ModelType of the fine-tuned weight, e.g., "Distillation", "Adapter", "Tfew"

hyperParameters [Required]
k8s.io/apimachinery/pkg/runtime.RawExtension

HyperParameters used for fine-tuning, stored as generic JSON for flexibility

ModelExtensionSpec [Required]
ModelExtensionSpec
(Members of ModelExtensionSpec are embedded into this type.)

ModelExtension is the common extension of the model

configuration
k8s.io/apimachinery/pkg/runtime.RawExtension

Configuration of the fine-tuned weight, stored as generic JSON for flexibility

storage [Required]
StorageSpec

Storage configuration for the fine-tuned weight

trainingJobRef
ObjectReference

TrainingJobID is the ID of the training job that produced this weight

HuggingFaceSecretReference

Appears in:

HuggingFaceSecretReference defines a reference to a Kubernetes Secret containing the Hugging Face API key. This secret must reside in the same namespace as the BenchmarkJob. Cross-namespace references are not allowed for security and simplicity.

FieldDescription
name [Required]
string

Name of the secret containing the Hugging Face API key. The secret must reside in the same namespace as the BenchmarkJob.

InferenceServiceReference

Appears in:

InferenceServiceReference defines the reference to a Kubernetes inference service.

FieldDescription
name [Required]
string

Name specifies the name of the inference service to benchmark.

namespace [Required]
string

Namespace specifies the Kubernetes namespace where the inference service is deployed. Cross-namespace references are allowed but require appropriate RBAC permissions.

InferenceServiceSpec

Appears in:

InferenceServiceSpec is the top level type for this resource

FieldDescription
predictor
PredictorSpec

Predictor defines the model serving spec It specifies how the model should be deployed and served, handling inference requests. Deprecated: Predictor is deprecated and will be removed in a future release. Please use Engine and Model fields instead.

engine
EngineSpec

Engine defines the serving engine spec This provides detailed container and pod specifications for model serving. It allows defining the model runner (container spec), as well as complete pod specifications including init containers, sidecar containers, and other pod-level configurations. Engine can also be configured for multi-node deployments using leader and worker specifications.

decoder
DecoderSpec

Decoder defines the decoder spec This is specifically used for PD (Prefill-Decode) disaggregated serving deployments. Similar to Engine in structure, it allows for container and pod specifications, but is only utilized when implementing the disaggregated serving pattern to separate the prefill and decode phases of inference.

model
ModelRef

Model defines the model to be used for inference, referencing either a BaseModel or a custom model. This allows models to be managed independently of the serving configuration.

runtime
ServingRuntimeRef

Runtime defines the serving runtime environment that will be used to execute the model. It is an inference service spec template that determines how the service should be deployed. Runtime is optional - if not defined, the operator will automatically select the best runtime based on the model's size, architecture, format, quantization, and framework.

router
RouterSpec

Router defines the router spec

kedaConfig [Required]
KedaConfig

KedaConfig defines the autoscaling configuration for KEDA Provides settings for event-driven autoscaling using KEDA (Kubernetes Event-driven Autoscaling), allowing the service to scale based on custom metrics or event sources.

InferenceServiceStatus

Appears in:

InferenceServiceStatus defines the observed state of InferenceService

FieldDescription
Status [Required]
knative.dev/pkg/apis/duck/v1.Status
(Members of Status are embedded into this type.)

Conditions for the InferenceService

  • EngineRouteReady: engine route readiness condition;
  • DecoderRouteReady: decoder route readiness condition;
  • PredictorReady: predictor readiness condition;
  • RoutesReady (serverless mode only): aggregated routing condition, i.e. endpoint readiness condition;
  • LatestDeploymentReady (serverless mode only): aggregated configuration condition, i.e. latest deployment readiness condition;
  • Ready: aggregated condition;
address
knative.dev/pkg/apis/duck/v1.Addressable

Addressable endpoint for the InferenceService

url
knative.dev/pkg/apis.URL

URL holds the url that will distribute traffic over the provided traffic targets. It generally has the form http[s]://{route-name}.{route-namespace}.{cluster-level-suffix}

components [Required]
map[ComponentType]ComponentStatusSpec

Statuses for the components of the InferenceService

modelStatus [Required]
ModelStatus

Model related statuses

KedaConfig

Appears in:

KedaConfig stores the configuration settings for KEDA autoscaling within the InferenceService. It includes fields like the Prometheus server address, custom query, scaling threshold, and operator.

FieldDescription
enableKeda [Required]
bool

EnableKeda determines whether KEDA autoscaling is enabled for the InferenceService.

  • true: KEDA will manage the autoscaling based on the provided configuration.
  • false: KEDA will not be used, and autoscaling will rely on other mechanisms (e.g., HPA).
promServerAddress [Required]
string

PromServerAddress specifies the address of the Prometheus server that KEDA will query to retrieve metrics for autoscaling decisions. This should be a fully qualified URL, including the protocol and port number.

Example: http://prometheus-operated.monitoring.svc.cluster.local:9090

customPromQuery [Required]
string

CustomPromQuery defines a custom Prometheus query that KEDA will execute to evaluate the desired metric for scaling. This query should return a single numerical value that represents the metric to be monitored.

Example: avg_over_time(http_requests_total{service="llama"}[5m])

scalingThreshold [Required]
string

ScalingThreshold sets the numerical threshold against which the result of the Prometheus query will be compared. Depending on the ScalingOperator, this threshold determines when to scale the number of replicas up or down.

Example: "10" - The Autoscaler will compare the metric value to 10.

scalingOperator [Required]
string

ScalingOperator specifies the comparison operator used by KEDA to decide whether to scale the Deployment. Common operators include:

  • "GreaterThanOrEqual": Scale up when the metric is >= ScalingThreshold.
  • "LessThanOrEqual": Scale down when the metric is <= ScalingThreshold.

This operator defines the condition under which scaling actions are triggered based on the evaluated metric.

Example: "GreaterThanOrEqual"

LeaderSpec

Appears in:

LeaderSpec defines the configuration for a leader node in a multi-node component The leader node coordinates the activities of worker nodes in distributed inference or token generation setups, handling task distribution and result aggregation.

FieldDescription
PodSpec
PodSpec
(Members of PodSpec are embedded into this type.)

Pod specification for the leader node This overrides the main PodSpec when specified Allows customization of the Kubernetes Pod configuration specifically for the leader node.

runner
RunnerSpec

Runner container override for customizing the main container This is essentially a container spec that can override the default container Provides fine-grained control over the container that executes the leader node's coordination logic.

LifeCycleState

(Alias of string)

Appears in:

LifeCycleState enum

ModelCopies

Appears in:

FieldDescription
failedCopies [Required]
int

How many copies of this predictor's models failed to load recently

totalCopies
int

Total number copies of this predictor's models that are currently loaded

ModelExtensionSpec

Appears in:

FieldDescription
displayName
string

DisplayName is the user-friendly name of the model

version
string
No description provided.
disabled
bool

Whether the model is enabled or not

vendor
string

Vendor of the model, e.g., "NVIDIA", "Meta", "HuggingFace"

compartmentID
string

CompartmentID is the compartment ID of the model

ModelFormat

Appears in:

FieldDescription
name [Required]
string

Name of the format in which the model is stored, e.g., "ONNX", "TensorFlow SavedModel", "PyTorch", "SafeTensors"

version
string

Version of the model format. Used in validating that a runtime supports a predictor. It Can be "major", "major.minor" or "major.minor.patch".

ModelFrameworkSpec

Appears in:

FieldDescription
name [Required]
string

Name of the library in which the model is stored, e.g., "ONNXRuntime", "TensorFlow", "PyTorch", "Transformer", "TensorRTLLM"

version
string

Version of the library. Used in validating that a runtime supports a predictor. It Can be "major", "major.minor" or "major.minor.patch".

ModelQuantization

(Alias of string)

Appears in:

ModelRef

Appears in:

FieldDescription
name [Required]
string

Name of the model being referenced Identifies the specific model to be used for inference.

kind [Required]
string

Kind of the model being referenced Defaults to ClusterBaseModel Specifies the Kubernetes resource kind of the referenced model.

apiGroup [Required]
string

APIGroup of the resource being referenced Defaults to ome.io Specifies the Kubernetes API group of the referenced model.

fineTunedWeights
[]string

Optional FineTunedWeights references References to fine-tuned weights that should be applied to the base model.

ModelRevisionStates

Appears in:

FieldDescription
activeModelState [Required]
ModelState

High level state string: Pending, Standby, Loading, Loaded, FailedToLoad

targetModelState [Required]
ModelState
No description provided.

ModelSizeRangeSpec

Appears in:

ModelSizeRangeSpec defines the range of model sizes supported by this runtime

FieldDescription
min
string

Minimum size of the model in bytes

max
string

Maximum size of the model in bytes

ModelSpec

Appears in:

FieldDescription
runtime
string

Specific ClusterServingRuntime/ServingRuntime name to use for deployment.

PredictorExtensionSpec [Required]
PredictorExtensionSpec
(Members of PredictorExtensionSpec are embedded into this type.) No description provided.
baseModel [Required]
string
No description provided.
fineTunedWeights [Required]
[]string
No description provided.

ModelState

(Alias of string)

Appears in:

ModelState enum

ModelStatus

Appears in:

FieldDescription
transitionStatus [Required]
TransitionStatus

Whether the available predictor endpoints reflect the current Spec or is in transition

modelRevisionStates
ModelRevisionStates

State information of the predictor's model.

lastFailureInfo
FailureInfo

Details of last failure, when load of target model is failed or blocked.

modelCopies
ModelCopies

Model copy information of the predictor's model.

ModelStatusSpec

Appears in:

ModelStatusSpec defines the observed state of Model weight

FieldDescription
lifecycle [Required]
string

LifeCycle is an enum of Deprecated, Experiment, Public, Internal

state [Required]
LifeCycleState

Status of the model weight

nodesReady [Required]
[]string
No description provided.
nodesFailed [Required]
[]string
No description provided.

ObjectReference

Appears in:

ObjectReference contains enough information to let you inspect or modify the referred object.

FieldDescription
name [Required]
string

Name of the referenced object

namespace [Required]
string

Namespace of the referenced object

PodOverride

Appears in:

FieldDescription
image
string

Image specifies the container image to use for the benchmark job.

env
[]k8s.io/api/core/v1.EnvVar

List of environment variables to set in the container.

envFrom
[]k8s.io/api/core/v1.EnvFromSource

List of sources to populate environment variables in the container.

volumeMounts
[]k8s.io/api/core/v1.VolumeMount

Pod volumes to mount into the container's filesystem.

resources
k8s.io/api/core/v1.ResourceRequirements

Compute Resources required by this container. Cannot be updated. More info: https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/

tolerations
[]k8s.io/api/core/v1.Toleration

If specified, the pod's tolerations.

nodeSelector
map[string]string

NodeSelector is a selector which must be true for the pod to fit on a node. Selector which must match a node's labels for the pod to be scheduled on that node. More info: https://kubernetes.io/docs/concepts/configuration/assign-pod-node/

affinity
k8s.io/api/core/v1.Affinity

If specified, the pod's scheduling constraints

volumes
[]k8s.io/api/core/v1.Volume

List of volumes that can be mounted by containers belonging to the pod. More info: https://kubernetes.io/docs/concepts/storage/volumes

PodSpec

Appears in:

PodSpec is a description of a pod.

FieldDescription
volumes
[]k8s.io/api/core/v1.Volume

List of volumes that can be mounted by containers belonging to the pod. More info: https://kubernetes.io/docs/concepts/storage/volumes

initContainers [Required]
[]k8s.io/api/core/v1.Container

List of initialization containers belonging to the pod. Init containers are executed in order prior to containers being started. If any init container fails, the pod is considered to have failed and is handled according to its restartPolicy. The name for an init container or normal container must be unique among all containers. Init containers may not have Lifecycle actions, Readiness probes, Liveness probes, or Startup probes. The resourceRequirements of an init container are taken into account during scheduling by finding the highest request/limit for each resource type, and then using the max of of that value or the sum of the normal containers. Limits are applied to init containers in a similar fashion. Init containers cannot currently be added or removed. Cannot be updated. More info: https://kubernetes.io/docs/concepts/workloads/pods/init-containers/

containers [Required]
[]k8s.io/api/core/v1.Container

List of containers belonging to the pod. Containers cannot currently be added or removed. There must be at least one container in a Pod. Cannot be updated.

ephemeralContainers
[]k8s.io/api/core/v1.EphemeralContainer

List of ephemeral containers run in this pod. Ephemeral containers may be run in an existing pod to perform user-initiated actions such as debugging. This list cannot be specified when creating a pod, and it cannot be modified by updating the pod spec. In order to add an ephemeral container to an existing pod, use the pod's ephemeralcontainers subresource.

restartPolicy
k8s.io/api/core/v1.RestartPolicy

Restart policy for all containers within the pod. One of Always, OnFailure, Never. In some contexts, only a subset of those values may be permitted. Default to Always. More info: https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#restart-policy

terminationGracePeriodSeconds
int64

Optional duration in seconds the pod needs to terminate gracefully. May be decreased in delete request. Value must be non-negative integer. The value zero indicates stop immediately via the kill signal (no opportunity to shut down). If this value is nil, the default grace period will be used instead. The grace period is the duration in seconds after the processes running in the pod are sent a termination signal and the time when the processes are forcibly halted with a kill signal. Set this value longer than the expected cleanup time for your process. Defaults to 30 seconds.

activeDeadlineSeconds
int64

Optional duration in seconds the pod may be active on the node relative to StartTime before the system will actively try to mark it failed and kill associated containers. Value must be a positive integer.

dnsPolicy
k8s.io/api/core/v1.DNSPolicy

Set DNS policy for the pod. Defaults to "ClusterFirst". Valid values are 'ClusterFirstWithHostNet', 'ClusterFirst', 'Default' or 'None'. DNS parameters given in DNSConfig will be merged with the policy selected with DNSPolicy. To have DNS options set along with hostNetwork, you have to specify DNS policy explicitly to 'ClusterFirstWithHostNet'.

nodeSelector
map[string]string

NodeSelector is a selector which must be true for the pod to fit on a node. Selector which must match a node's labels for the pod to be scheduled on that node. More info: https://kubernetes.io/docs/concepts/configuration/assign-pod-node/

serviceAccountName
string

ServiceAccountName is the name of the ServiceAccount to use to run this pod. More info: https://kubernetes.io/docs/tasks/configure-pod-container/configure-service-account/

serviceAccount
string

DeprecatedServiceAccount is a deprecated alias for ServiceAccountName. Deprecated: Use serviceAccountName instead.

automountServiceAccountToken
bool

AutomountServiceAccountToken indicates whether a service account token should be automatically mounted.

nodeName
string

NodeName indicates in which node this pod is scheduled. If empty, this pod is a candidate for scheduling by the scheduler defined in schedulerName. Once this field is set, the kubelet for this node becomes responsible for the lifecycle of this pod. This field should not be used to express a desire for the pod to be scheduled on a specific node. https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#nodename

hostNetwork
bool

Host networking requested for this pod. Use the host's network namespace. If this option is set, the ports that will be used must be specified. Default to false.

hostPID
bool

Use the host's pid namespace. Optional: Default to false.

hostIPC
bool

Use the host's ipc namespace. Optional: Default to false.

shareProcessNamespace
bool

Share a single process namespace between all of the containers in a pod. When this is set containers will be able to view and signal processes from other containers in the same pod, and the first process in each container will not be assigned PID 1. HostPID and ShareProcessNamespace cannot both be set. Optional: Default to false.

securityContext
k8s.io/api/core/v1.PodSecurityContext

SecurityContext holds pod-level security attributes and common container settings. Optional: Defaults to empty. See type description for default values of each field.

imagePullSecrets
[]k8s.io/api/core/v1.LocalObjectReference

ImagePullSecrets is an optional list of references to secrets in the same namespace to use for pulling any of the images used by this PodSpec. If specified, these secrets will be passed to individual puller implementations for them to use. More info: https://kubernetes.io/docs/concepts/containers/images#specifying-imagepullsecrets-on-a-pod

hostname
string

Specifies the hostname of the Pod If not specified, the pod's hostname will be set to a system-defined value.

subdomain
string

If specified, the fully qualified Pod hostname will be "...svc.". If not specified, the pod will not have a domainname at all.

affinity
k8s.io/api/core/v1.Affinity

If specified, the pod's scheduling constraints

schedulerName
string

If specified, the pod will be dispatched by specified scheduler. If not specified, the pod will be dispatched by default scheduler.

tolerations
[]k8s.io/api/core/v1.Toleration

If specified, the pod's tolerations.

hostAliases
[]k8s.io/api/core/v1.HostAlias

HostAliases is an optional list of hosts and IPs that will be injected into the pod's hosts file if specified.

priorityClassName
string

If specified, indicates the pod's priority. "system-node-critical" and "system-cluster-critical" are two special keywords which indicate the highest priorities with the former being the highest priority. Any other name must be defined by creating a PriorityClass object with that name. If not specified, the pod priority will be default or zero if there is no default.

priority
int32

The priority value. Various system components use this field to find the priority of the pod. When Priority Admission Controller is enabled, it prevents users from setting this field. The admission controller populates this field from PriorityClassName. The higher the value, the higher the priority.

dnsConfig
k8s.io/api/core/v1.PodDNSConfig

Specifies the DNS parameters of a pod. Parameters specified here will be merged to the generated DNS configuration based on DNSPolicy.

readinessGates
[]k8s.io/api/core/v1.PodReadinessGate

If specified, all readiness gates will be evaluated for pod readiness. A pod is ready when all its containers are ready AND all conditions specified in the readiness gates have status equal to "True" More info: https://git.k8s.io/enhancements/keps/sig-network/580-pod-readiness-gates

runtimeClassName
string

RuntimeClassName refers to a RuntimeClass object in the node.k8s.io group, which should be used to run this pod. If no RuntimeClass resource matches the named class, the pod will not be run. If unset or empty, the "legacy" RuntimeClass will be used, which is an implicit class with an empty definition that uses the default runtime handler. More info: https://git.k8s.io/enhancements/keps/sig-node/585-runtime-class

enableServiceLinks
bool

EnableServiceLinks indicates whether information about services should be injected into pod's environment variables, matching the syntax of Docker links. Optional: Defaults to true.

preemptionPolicy
k8s.io/api/core/v1.PreemptionPolicy

PreemptionPolicy is the Policy for preempting pods with lower priority. One of Never, PreemptLowerPriority. Defaults to PreemptLowerPriority if unset.

overhead
k8s.io/api/core/v1.ResourceList

Overhead represents the resource overhead associated with running a pod for a given RuntimeClass. This field will be autopopulated at admission time by the RuntimeClass admission controller. If the RuntimeClass admission controller is enabled, overhead must not be set in Pod create requests. The RuntimeClass admission controller will reject Pod create requests which have the overhead already set. If RuntimeClass is configured and selected in the PodSpec, Overhead will be set to the value defined in the corresponding RuntimeClass, otherwise it will remain unset and treated as zero. More info: https://git.k8s.io/enhancements/keps/sig-node/688-pod-overhead/README.md

topologySpreadConstraints
[]k8s.io/api/core/v1.TopologySpreadConstraint

TopologySpreadConstraints describes how a group of pods ought to spread across topology domains. Scheduler will schedule pods in a way which abides by the constraints. All topologySpreadConstraints are ANDed.

setHostnameAsFQDN
bool

If true the pod's hostname will be configured as the pod's FQDN, rather than the leaf name (the default). In Linux containers, this means setting the FQDN in the hostname field of the kernel (the nodename field of struct utsname). In Windows containers, this means setting the registry value of hostname for the registry key HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters to FQDN. If a pod does not have FQDN, this has no effect. Default to false.

os
k8s.io/api/core/v1.PodOS

Specifies the OS of the containers in the pod. Some pod and container fields are restricted if this is set.

If the OS field is set to linux, the following fields must be unset: -securityContext.windowsOptions

If the OS field is set to windows, following fields must be unset:

  • spec.hostPID
  • spec.hostIPC
  • spec.hostUsers
  • spec.securityContext.appArmorProfile
  • spec.securityContext.seLinuxOptions
  • spec.securityContext.seccompProfile
  • spec.securityContext.fsGroup
  • spec.securityContext.fsGroupChangePolicy
  • spec.securityContext.sysctls
  • spec.shareProcessNamespace
  • spec.securityContext.runAsUser
  • spec.securityContext.runAsGroup
  • spec.securityContext.supplementalGroups
  • spec.securityContext.supplementalGroupsPolicy
  • spec.containers[*].securityContext.appArmorProfile
  • spec.containers[*].securityContext.seLinuxOptions
  • spec.containers[*].securityContext.seccompProfile
  • spec.containers[*].securityContext.capabilities
  • spec.containers[*].securityContext.readOnlyRootFilesystem
  • spec.containers[*].securityContext.privileged
  • spec.containers[*].securityContext.allowPrivilegeEscalation
  • spec.containers[*].securityContext.procMount
  • spec.containers[*].securityContext.runAsUser
  • spec.containers[*].securityContext.runAsGroup
hostUsers
bool

Use the host's user namespace. Optional: Default to true. If set to true or not present, the pod will be run in the host user namespace, useful for when the pod needs a feature only available to the host user namespace, such as loading a kernel module with CAP_SYS_MODULE. When set to false, a new userns is created for the pod. Setting false is useful for mitigating container breakout vulnerabilities even allowing users to run their containers as root without actually having root privileges on the host. This field is alpha-level and is only honored by servers that enable the UserNamespacesSupport feature.

schedulingGates
[]k8s.io/api/core/v1.PodSchedulingGate

SchedulingGates is an opaque list of values that if specified will block scheduling the pod. If schedulingGates is not empty, the pod will stay in the SchedulingGated state and the scheduler will not attempt to schedule the pod.

SchedulingGates can only be set at pod creation time, and be removed only afterwards.

resourceClaims
[]k8s.io/api/core/v1.PodResourceClaim

ResourceClaims defines which ResourceClaims must be allocated and reserved before the Pod is allowed to start. The resources will be made available to those containers which consume them by name.

This is an alpha field and requires enabling the DynamicResourceAllocation feature gate.

This field is immutable.

PredictorExtensionSpec

Appears in:

PredictorExtensionSpec defines configuration shared across all predictor frameworks

FieldDescription
storageUri
string

This field points to the location of the model which is mounted onto the pod.

runtimeVersion
string

Runtime version of the predictor docker image

protocolVersion
github.com/sgl-project/ome/pkg/constants.InferenceServiceProtocol

Protocol version to use by the predictor (i.e. v1 or v2 or grpc-v1 or grpc-v2)

Container
k8s.io/api/core/v1.Container
(Members of Container are embedded into this type.)

Container enables overrides for the predictor. Each framework will have different defaults that are populated in the underlying container spec.

PredictorSpec

Appears in:

PredictorSpec defines the configuration for a predictor, The following fields follow a "1-of" semantic. Users must specify exactly one spec.

FieldDescription
model [Required]
ModelSpec

Model spec for any arbitrary framework.

PodSpec [Required]
PodSpec
(Members of PodSpec are embedded into this type.)

This spec is dual purpose.

  1. Provide a full PodSpec for custom predictor. The field PodSpec.Containers is mutually exclusive with other predictors (i.e. TFServing).
  2. Provide a predictor (i.e. TFServing) and specify PodSpec overrides, you must not provide PodSpec.Containers in this case.
ComponentExtensionSpec [Required]
ComponentExtensionSpec
(Members of ComponentExtensionSpec are embedded into this type.)

Component extension defines the deployment configurations for a predictor

workerSpec
WorkerSpec

WorkerSpec for the predictor, this is used for multi-node serving without Ray Cluster

RouterSpec

Appears in:

RouterSpec defines the configuration for the Router component, which handles request routing

FieldDescription
PodSpec [Required]
PodSpec
(Members of PodSpec are embedded into this type.)

PodSpec defines the container configuration for the router

ComponentExtensionSpec [Required]
ComponentExtensionSpec
(Members of ComponentExtensionSpec are embedded into this type.)

ComponentExtensionSpec defines deployment configuration like min/max replicas, scaling metrics, etc.

runner
RunnerSpec

This is essentially a container spec that can override the default container

config
map[string]string

Additional configuration parameters for the runner This can include framework-specific settings

RunnerSpec

Appears in:

RunnerSpec defines container configuration plus additional config settings The Runner is the primary container that executes the model serving or token generation logic.

FieldDescription
Container
k8s.io/api/core/v1.Container
(Members of Container are embedded into this type.)

Container spec for the runner Provides complete Kubernetes container configuration for the primary execution container.

ScaleMetric

(Alias of string)

Appears in:

ScaleMetric enum

ServiceMetadata

Appears in:

ServiceMetadata contains metadata fields for recording the backend model server's configuration and version details. This information helps track experiment context, enabling users to filter and query experiments based on server properties.

FieldDescription
engine [Required]
string

Engine specifies the backend model server engine. Supported values: "vLLM", "SGLang", "TGI".

version [Required]
string

Version specifies the version of the model server (e.g., "0.5.3").

gpuType [Required]
string

GpuType specifies the type of GPU used by the model server. Supported values: "H100", "A100", "MI300", "A10".

gpuCount [Required]
int

GpuCount indicates the number of GPU cards available on the model server.

ServingRuntimePodSpec

Appears in:

FieldDescription
containers
[]k8s.io/api/core/v1.Container

List of containers belonging to the pod. Containers cannot currently be added or removed. Cannot be updated.

volumes
[]k8s.io/api/core/v1.Volume

List of volumes that can be mounted by containers belonging to the pod. More info: https://kubernetes.io/docs/concepts/storage/volumes

nodeSelector
map[string]string

NodeSelector is a selector which must be true for the pod to fit on a node. Selector which must match a node's labels for the pod to be scheduled on that node. More info: https://kubernetes.io/docs/concepts/configuration/assign-pod-node/

affinity
k8s.io/api/core/v1.Affinity

If specified, the pod's scheduling constraints

tolerations
[]k8s.io/api/core/v1.Toleration

If specified, the pod's tolerations.

labels
map[string]string

Labels that will be add to the pod. More info: http://kubernetes.io/docs/user-guide/labels

annotations
map[string]string

Annotations that will be add to the pod. More info: http://kubernetes.io/docs/user-guide/annotations

imagePullSecrets
[]k8s.io/api/core/v1.LocalObjectReference

ImagePullSecrets is an optional list of references to secrets in the same namespace to use for pulling any of the images used by this PodSpec. If specified, these secrets will be passed to individual puller implementations for them to use. More info: https://kubernetes.io/docs/concepts/containers/images#specifying-imagepullsecrets-on-a-pod

schedulerName
string

If specified, the pod will be dispatched by specified scheduler. If not specified, the pod will be dispatched by default scheduler.

hostIPC
bool

Use the host's ipc namespace. Optional: Default to false.

dnsPolicy
k8s.io/api/core/v1.DNSPolicy

Set DNS policy for the pod. Defaults to "ClusterFirst". Valid values are 'ClusterFirstWithHostNet', 'ClusterFirst', 'Default' or 'None'. DNS parameters given in DNSConfig will be merged with the policy selected with DNSPolicy. To have DNS options set along with hostNetwork, you have to specify DNS policy explicitly to 'ClusterFirstWithHostNet'.

hostNetwork
bool

Host networking requested for this pod. Use the host's network namespace. If this option is set, the ports that will be used must be specified. Default to false.

ServingRuntimeRef

Appears in:

FieldDescription
name [Required]
string

Name of the runtime being referenced Identifies the specific runtime environment to be used for model execution.

kind [Required]
string

Kind of the runtime being referenced Defaults to ClusterServingRuntime Specifies the Kubernetes resource kind of the referenced runtime. ClusterServingRuntime is a cluster-wide runtime, while ServingRuntime is namespace-scoped.

apiGroup [Required]
string

APIGroup of the resource being referenced Defaults to ome.io Specifies the Kubernetes API group of the referenced runtime.

ServingRuntimeSpec

Appears in:

ServingRuntimeSpec defines the desired state of ServingRuntime. This spec is currently provisional and are subject to change as details regarding single-model serving and multi-model serving are hammered out.

FieldDescription
supportedModelFormats [Required]
[]SupportedModelFormat

Model formats and version supported by this runtime

modelSizeRange
ModelSizeRangeSpec

ModelSizeRange is the range of model sizes supported by this runtime

disabled
bool

Set to true to disable use of this runtime

routerConfig
RouterSpec

Router configuration for this runtime

engineConfig
EngineSpec

Engine configuration for this runtime

decoderConfig
DecoderSpec

Decoder configuration for this runtime

protocolVersions
[]github.com/sgl-project/ome/pkg/constants.InferenceServiceProtocol

Supported protocol versions (i.e. openAI or cohere or openInference-v1 or openInference-v2)

ServingRuntimePodSpec [Required]
ServingRuntimePodSpec
(Members of ServingRuntimePodSpec are embedded into this type.)

PodSpec for the serving runtime

workers
WorkerPodSpec

WorkerPodSpec for the serving runtime, this is used for multi-node serving without Ray Cluster

ServingRuntimeStatus

Appears in:

ServingRuntimeStatus defines the observed state of ServingRuntime

StorageSpec

Appears in:

FieldDescription
path
string

Path is the absolute path where the model will be downloaded and stored on the node.

schemaPath
string

SchemaPath is the path to the model schema or configuration file within the storage system. This can be used to validate the model or customize how it's loaded.

parameters
map[string]string

Parameters contain key-value pairs to override default storage credentials or configuration. These values are typically used to configure access to object storage or mount options.

key
string

StorageKey is the name of the key in a Kubernetes Secret used to authenticate access to the model storage. This key will be used to fetch credentials during model download or access.

storageUri [Required]
string

StorageUri specifies the source URI of the model in a supported storage backend. Supported formats:

  • OCI Object Storage: oci://n/{namespace}/b/{bucket}/o/{object_path}
  • Persistent Volume: pvc://{pvc-name}/{sub-path}
  • Vendor-specific: vendor://{vendor-name}/{resource-type}/{resource-path} This field is required.
nodeSelector
map[string]string

NodeSelector defines a set of key-value label pairs that must be present on a node for the model to be scheduled and downloaded onto that node.

nodeAffinity
k8s.io/api/core/v1.NodeAffinity

NodeAffinity describes the node affinity rules that further constrain which nodes are eligible to download and store this model, based on advanced scheduling policies.

SupportedModelFormat

Appears in:

FieldDescription
name
string

TODO this field is being used as model format name, and this is not correct, we should deprecate this and use Name from ModelFormat Name of the model

modelFormat [Required]
ModelFormat

ModelFormat of the model, e.g., "PyTorch", "TensorFlow", "ONNX", "SafeTensors"

modelType
string

DEPRECATED: This field is deprecated and will be removed in future releases.

version
string

Version of the model format. Used in validating that a runtime supports a predictor. It Can be "major", "major.minor" or "major.minor.patch".

modelFramework [Required]
ModelFrameworkSpec

ModelFramework of the model, e.g., "PyTorch", "TensorFlow", "ONNX", "Transformers"

modelArchitecture
string

ModelArchitecture of the model, e.g., "LlamaForCausalLM", "GemmaForCausalLM", "MixtralForCausalLM"

quantization
ModelQuantization

Quantization of the model, e.g., "fp8", "fbgemm_fp8", "int4"

autoSelect
bool

Set to true to allow the ServingRuntime to be used for automatic model placement if this model format is specified with no explicit runtime.

priority
int32

Priority of this serving runtime for auto selection. This is used to select the serving runtime if more than one serving runtime supports the same model format. The value should be greater than zero. The higher the value, the higher the priority. Priority is not considered if AutoSelect is either false or not specified. Priority can be overridden by specifying the runtime in the InferenceService.

TransitionStatus

(Alias of string)

Appears in:

TransitionStatus enum

WorkerPodSpec

Appears in:

FieldDescription
size
int

Size of the worker, this is the number of pods in the worker.

ServingRuntimePodSpec
ServingRuntimePodSpec
(Members of ServingRuntimePodSpec are embedded into this type.)

PodSpec for the worker

WorkerSpec

Appears in:

WorkerSpec defines the configuration for worker nodes in a multi-node component Worker nodes perform the distributed processing tasks assigned by the leader node, enabling horizontal scaling for compute-intensive workloads.

FieldDescription
PodSpec
PodSpec
(Members of PodSpec are embedded into this type.)

PodSpec for the worker Allows customization of the Kubernetes Pod configuration specifically for worker nodes.

size
int

Size of the worker, this is the number of pods in the worker. Controls how many worker pod instances will be deployed for horizontal scaling.

runner
RunnerSpec

Runner container override for customizing the main container This is essentially a container spec that can override the default container Provides fine-grained control over the container that executes the worker node's processing logic.