Package v1beta1 contains API Schema definitions for the serving v1beta1 API group
BaseModel
Appears in:
BaseModel is the Schema for the basemodels API
Field | Description |
---|---|
apiVersion string | ome.io/v1beta1 |
kind string | BaseModel |
spec [Required]BaseModelSpec | No description provided. |
status [Required]ModelStatusSpec | No description provided. |
BenchmarkJob
Appears in:
BenchmarkJob is the schema for the BenchmarkJobs API
Field | Description |
---|---|
apiVersion string | ome.io/v1beta1 |
kind string | BenchmarkJob |
spec [Required]BenchmarkJobSpec | No description provided. |
status [Required]BenchmarkJobStatus | No description provided. |
ClusterBaseModel
Appears in:
ClusterBaseModel is the Schema for the basemodels API
Field | Description |
---|---|
apiVersion string | ome.io/v1beta1 |
kind string | ClusterBaseModel |
spec [Required]BaseModelSpec | No description provided. |
status [Required]ModelStatusSpec | No description provided. |
ClusterServingRuntime
Appears in:
ClusterServingRuntime is the Schema for the servingruntimes API
Field | Description |
---|---|
apiVersion string | ome.io/v1beta1 |
kind string | ClusterServingRuntime |
spec [Required]ServingRuntimeSpec | No description provided. |
status [Required]ServingRuntimeStatus | No description provided. |
FineTunedWeight
Appears in:
FineTunedWeight is the Schema for the finetunedweights API
Field | Description |
---|---|
apiVersion string | ome.io/v1beta1 |
kind string | FineTunedWeight |
spec [Required]FineTunedWeightSpec | No description provided. |
status [Required]ModelStatusSpec | No description provided. |
InferenceService
Appears in:
InferenceService is the Schema for the InferenceServices API
Field | Description |
---|---|
apiVersion string | ome.io/v1beta1 |
kind string | InferenceService |
spec [Required]InferenceServiceSpec | No description provided. |
status [Required]InferenceServiceStatus | No description provided. |
ServingRuntime
Appears in:
ServingRuntime is the Schema for the servingruntimes API
Field | Description |
---|---|
apiVersion string | ome.io/v1beta1 |
kind string | ServingRuntime |
spec [Required]ServingRuntimeSpec | No description provided. |
status [Required]ServingRuntimeStatus | No description provided. |
BaseModelSpec
Appears in:
BaseModelSpec defines the desired state of BaseModel
Field | Description |
---|---|
modelFormat ModelFormat | No description provided. |
modelType string | ModelType defines the architecture family of the model (e.g., "bert", "gpt2", "llama"). This value typically corresponds to the "model_type" field in a Hugging Face model's config.json. It is used to identify the transformer architecture and inform runtime selection and tokenizer behavior. |
modelFramework ModelFrameworkSpec | ModelFramework specifies the underlying framework used by the model, such as "ONNX", "TensorFlow", "PyTorch", "Transformer", or "TensorRTLLM". This value helps determine the appropriate runtime for model serving. |
modelArchitecture string | ModelArchitecture specifies the concrete model implementation or head, such as "LlamaForCausalLM", "GemmaForCausalLM", or "MixtralForCausalLM". This is often derived from the "architectures" field in Hugging Face config.json. |
quantization ModelQuantization | Quantization defines the quantization scheme applied to the model weights, such as "fp8", "fbgemm_fp8", or "int4". This influences runtime compatibility and performance. |
modelParameterSize string | ModelParameterSize indicates the total number of parameters in the model, expressed in human-readable form such as "7B", "13B", or "175B". This can be used for scheduling or runtime selection. |
modelCapabilities []string | ModelCapabilities of the model, e.g., "TEXT_GENERATION", "TEXT_SUMMARIZATION", "TEXT_EMBEDDINGS" |
modelConfiguration k8s.io/apimachinery/pkg/runtime.RawExtension | Configuration of the model, stored as generic JSON for flexibility. |
storage [Required]StorageSpec | Storage configuration for the model |
ModelExtensionSpec [Required]ModelExtensionSpec | (Members of ModelExtensionSpec are embedded into this type.)ModelExtension is the common extension of the model |
servingMode [Required][]string | No description provided. |
maxTokens int32 | MaxTokens is the maximum number of tokens that can be processed by the model |
additionalMetadata map[string]string | Additional metadata for the model |
BenchmarkJobSpec
Appears in:
BenchmarkJobSpec defines the specification for a benchmark job. All fields within this specification collectively represent the desired state and configuration of a BenchmarkJob.
Field | Description |
---|---|
huggingFaceSecretReference HuggingFaceSecretReference | HuggingFaceSecretReference is a reference to a Kubernetes Secret containing the Hugging Face API key. The referenced Secret must reside in the same namespace as the BenchmarkJob. This field replaces the raw HuggingFaceAPIKey field for improved security. |
endpoint [Required]EndpointSpec | Endpoint is the reference to the inference service to benchmark. |
serviceMetadata ServiceMetadata | ServiceMetadata records metadata about the backend model server or service being benchmarked. This includes details such as server engine, version, and GPU configuration for filtering experiments. |
task [Required]string | Task specifies the task to benchmark, pattern: -to- (e.g., "text-to-text", "image-to-text"). |
trafficScenarios []string | TrafficScenarios contains a list of traffic scenarios to simulate during the benchmark. If not provided, defaults will be assigned via genai-bench. |
numConcurrency []int | NumConcurrency defines a list of concurrency levels to test during the benchmark. If not provided, defaults will be assigned via genai-bench. |
maxTimePerIteration [Required]int | MaxTimePerIteration specifies the maximum time (in minutes) for a single iteration. Each iteration runs for a specific combination of TrafficScenarios and NumConcurrency. |
maxRequestsPerIteration [Required]int | MaxRequestsPerIteration specifies the maximum number of requests for a single iteration. Each iteration runs for a specific combination of TrafficScenarios and NumConcurrency. |
additionalRequestParams map[string]string | AdditionalRequestParams contains additional request parameters as a map. |
dataset StorageSpec | Dataset is the dataset used for benchmarking. It is optional and only required for tasks other than "text-to-". |
outputLocation [Required]StorageSpec | OutputLocation specifies where the benchmark results will be stored (e.g., object storage). |
resultFolderName string | ResultFolderName specifies the name of the folder that stores the benchmark result. A default name will be assigned if not specified. |
podOverride PodOverride | Pod defines the pod configuration for the benchmark job. This is optional, if not provided, default values will be used. |
BenchmarkJobStatus
Appears in:
BenchmarkJobStatus reflects the state and results of the benchmark job. It will be set and updated by the controller.
Field | Description |
---|---|
state [Required]string | State represents the current state of the benchmark job: "Pending", "Running", "Completed", "Failed". |
startTime k8s.io/apimachinery/pkg/apis/meta/v1.Time | StartTime is the timestamp for when the benchmark job started. |
completionTime k8s.io/apimachinery/pkg/apis/meta/v1.Time | CompletionTime is the timestamp for when the benchmark job completed, either successfully or unsuccessfully. |
lastReconcileTime k8s.io/apimachinery/pkg/apis/meta/v1.Time | LastReconcileTime is the timestamp for the last time the job was reconciled by the controller. |
failureMessage string | FailureMessage contains any error messages if the benchmark job failed. |
details string | Details provide additional information or metadata about the benchmark job. |
ComponentExtensionSpec
Appears in:
ComponentExtensionSpec defines the deployment configuration for a given InferenceService component
Field | Description |
---|---|
minReplicas int | Minimum number of replicas, defaults to 1 but can be set to 0 to enable scale-to-zero. |
maxReplicas int | Maximum number of replicas for autoscaling. |
scaleTarget int | ScaleTarget specifies the integer target value of the metric type the Autoscaler watches for. concurrency and rps targets are supported by Knative Pod Autoscaler (https://knative.dev/docs/serving/autoscaling/autoscaling-targets/). |
scaleMetric ScaleMetric | ScaleMetric defines the scaling metric type watched by autoscaler possible values are concurrency, rps, cpu, memory. concurrency, rps are supported via Knative Pod Autoscaler(https://knative.dev/docs/serving/autoscaling/autoscaling-metrics). |
containerConcurrency int64 | ContainerConcurrency specifies how many requests can be processed concurrently, this sets the hard limit of the container concurrency(https://knative.dev/docs/serving/autoscaling/concurrency). |
timeoutSeconds int64 | TimeoutSeconds specifies the number of seconds to wait before timing out a request to the component. |
canaryTrafficPercent int64 | CanaryTrafficPercent defines the traffic split percentage between the candidate revision and the last ready revision |
labels map[string]string | Labels that will be add to the component pod. More info: http://kubernetes.io/docs/user-guide/labels |
annotations map[string]string | Annotations that will be add to the component pod. More info: http://kubernetes.io/docs/user-guide/annotations |
deploymentStrategy k8s.io/api/apps/v1.DeploymentStrategy | The deployment strategy to use to replace existing pods with new ones. Only applicable for raw deployment mode. |
kedaConfig [Required]KedaConfig | No description provided. |
ComponentStatusSpec
Appears in:
ComponentStatusSpec describes the state of the component
Field | Description |
---|---|
latestReadyRevision string | Latest revision name that is in ready state |
latestCreatedRevision string | Latest revision name that is created |
previousRolledoutRevision string | Previous revision name that is rolled out with 100 percent traffic |
latestRolledoutRevision string | Latest revision name that is rolled out with 100 percent traffic |
traffic []knative.dev/serving/pkg/apis/serving/v1.TrafficTarget | Traffic holds the configured traffic distribution for latest ready revision and previous rolled out revision. |
url knative.dev/pkg/apis.URL | URL holds the primary url that will distribute traffic over the provided traffic targets. This will be one the REST or gRPC endpoints that are available. It generally has the form http[s]://{route-name}.{route-namespace}.{cluster-level-suffix} |
restURL knative.dev/pkg/apis.URL | REST endpoint of the component if available. |
address knative.dev/pkg/apis/duck/v1.Addressable | Addressable endpoint for the InferenceService |
DecoderSpec
Appears in:
DecoderSpec defines the configuration for the Decoder component (token generation in PD-disaggregated deployment) Used specifically for prefill-decode disaggregated deployments to handle the token generation phase. Similar to EngineSpec in structure, it allows for detailed pod and container configuration, but is specifically used for the decode phase when separating prefill and decode processes.
Field | Description |
---|---|
PodSpec PodSpec | (Members of PodSpec are embedded into this type.)This spec provides a full PodSpec for the decoder component Allows complete customization of the Kubernetes Pod configuration including containers, volumes, security contexts, affinity rules, and other pod settings. |
ComponentExtensionSpec [Required]ComponentExtensionSpec | (Members of ComponentExtensionSpec are embedded into this type.)ComponentExtensionSpec defines deployment configuration like min/max replicas, scaling metrics, etc. Controls scaling behavior and resource allocation for the decoder component. |
runner RunnerSpec | Runner container override for customizing the main container This is essentially a container spec that can override the default container Defines the main decoder container configuration, including image, resource requests/limits, environment variables, and command. |
leader LeaderSpec | Leader node configuration (only used for MultiNode deployment) Defines the pod and container spec for the leader node that coordinates distributed token generation in multi-node deployments. |
worker WorkerSpec | Worker nodes configuration (only used for MultiNode deployment) Defines the pod and container spec for worker nodes that perform distributed token generation tasks as directed by the leader. |
Endpoint
Appears in:
Endpoint defines a direct URL-based inference service with additional API configuration.
Field | Description |
---|---|
url [Required]string | URL represents the endpoint URL for the inference service. |
apiFormat [Required]string | APIFormat specifies the type of API, such as "openai" or "oci-cohere". |
modelName [Required]string | ModelName specifies the name of the model being served at the endpoint. Useful for endpoints that require model-specific configuration. For instance, for openai API, this is a required field in the payload |
EndpointSpec
Appears in:
EndpointSpec defines a reference to an inference service. It supports either a Kubernetes-style reference (InferenceService) or an Endpoint struct for a direct URL. Cross-namespace references are supported for InferenceService but require appropriate RBAC permissions to access resources in the target namespace.
Field | Description |
---|---|
inferenceService InferenceServiceReference | InferenceService holds a Kubernetes reference to an internal inference service. |
endpoint Endpoint | Endpoint holds the details of a direct endpoint for an external inference service, including URL and API details. |
EngineSpec
Appears in:
EngineSpec defines the configuration for the Engine component (can be used for both single-node and multi-node deployments) Provides a comprehensive specification for deploying model serving containers and pods. It allows for complete Kubernetes pod configuration including main containers, init containers, sidecars, volumes, and other pod-level settings. For distributed deployments, it supports leader-worker architecture configuration.
Field | Description |
---|---|
PodSpec PodSpec | (Members of PodSpec are embedded into this type.)This spec provides a full PodSpec for the engine component Allows complete customization of the Kubernetes Pod configuration including containers, volumes, security contexts, affinity rules, and other pod settings. |
ComponentExtensionSpec [Required]ComponentExtensionSpec | (Members of ComponentExtensionSpec are embedded into this type.)ComponentExtensionSpec defines deployment configuration like min/max replicas, scaling metrics, etc. Controls scaling behavior and resource allocation for the engine component. |
runner RunnerSpec | Runner container override for customizing the engine container This is essentially a container spec that can override the default container Defines the main model runner container configuration, including image, resource requests/limits, environment variables, and command. |
leader LeaderSpec | Leader node configuration (only used for MultiNode deployment) Defines the pod and container spec for the leader node that coordinates distributed inference in multi-node deployments. |
worker WorkerSpec | Worker nodes configuration (only used for MultiNode deployment) Defines the pod and container spec for worker nodes that perform distributed processing tasks as directed by the leader. |
FailureInfo
Appears in:
Field | Description |
---|---|
location string | Name of component to which the failure relates (usually Pod name) |
reason FailureReason | High level class of failure |
message string | Detailed error message |
modelRevisionName string | Internal Revision/ID of model, tied to specific Spec contents |
time k8s.io/apimachinery/pkg/apis/meta/v1.Time | Time failure occurred or was discovered |
exitCode int32 | Exit status from the last termination of the container |
FailureReason
(Alias of string
)
Appears in:
FailureReason enum
FineTunedWeightSpec
Appears in:
FineTunedWeightSpec defines the desired state of FineTunedWeight
Field | Description |
---|---|
baseModelRef [Required]ObjectReference | Reference to the base model that this weight is fine-tuned from |
modelType [Required]string | ModelType of the fine-tuned weight, e.g., "Distillation", "Adapter", "Tfew" |
hyperParameters [Required]k8s.io/apimachinery/pkg/runtime.RawExtension | HyperParameters used for fine-tuning, stored as generic JSON for flexibility |
ModelExtensionSpec [Required]ModelExtensionSpec | (Members of ModelExtensionSpec are embedded into this type.)ModelExtension is the common extension of the model |
configuration k8s.io/apimachinery/pkg/runtime.RawExtension | Configuration of the fine-tuned weight, stored as generic JSON for flexibility |
storage [Required]StorageSpec | Storage configuration for the fine-tuned weight |
trainingJobRef ObjectReference | TrainingJobID is the ID of the training job that produced this weight |
HuggingFaceSecretReference
Appears in:
HuggingFaceSecretReference defines a reference to a Kubernetes Secret containing the Hugging Face API key. This secret must reside in the same namespace as the BenchmarkJob. Cross-namespace references are not allowed for security and simplicity.
Field | Description |
---|---|
name [Required]string | Name of the secret containing the Hugging Face API key. The secret must reside in the same namespace as the BenchmarkJob. |
InferenceServiceReference
Appears in:
InferenceServiceReference defines the reference to a Kubernetes inference service.
Field | Description |
---|---|
name [Required]string | Name specifies the name of the inference service to benchmark. |
namespace [Required]string | Namespace specifies the Kubernetes namespace where the inference service is deployed. Cross-namespace references are allowed but require appropriate RBAC permissions. |
InferenceServiceSpec
Appears in:
InferenceServiceSpec is the top level type for this resource
Field | Description |
---|---|
predictor PredictorSpec | Predictor defines the model serving spec It specifies how the model should be deployed and served, handling inference requests. Deprecated: Predictor is deprecated and will be removed in a future release. Please use Engine and Model fields instead. |
engine EngineSpec | Engine defines the serving engine spec This provides detailed container and pod specifications for model serving. It allows defining the model runner (container spec), as well as complete pod specifications including init containers, sidecar containers, and other pod-level configurations. Engine can also be configured for multi-node deployments using leader and worker specifications. |
decoder DecoderSpec | Decoder defines the decoder spec This is specifically used for PD (Prefill-Decode) disaggregated serving deployments. Similar to Engine in structure, it allows for container and pod specifications, but is only utilized when implementing the disaggregated serving pattern to separate the prefill and decode phases of inference. |
model ModelRef | Model defines the model to be used for inference, referencing either a BaseModel or a custom model. This allows models to be managed independently of the serving configuration. |
runtime ServingRuntimeRef | Runtime defines the serving runtime environment that will be used to execute the model. It is an inference service spec template that determines how the service should be deployed. Runtime is optional - if not defined, the operator will automatically select the best runtime based on the model's size, architecture, format, quantization, and framework. |
router RouterSpec | Router defines the router spec |
kedaConfig [Required]KedaConfig | KedaConfig defines the autoscaling configuration for KEDA Provides settings for event-driven autoscaling using KEDA (Kubernetes Event-driven Autoscaling), allowing the service to scale based on custom metrics or event sources. |
InferenceServiceStatus
Appears in:
InferenceServiceStatus defines the observed state of InferenceService
Field | Description |
---|---|
Status [Required]knative.dev/pkg/apis/duck/v1.Status | (Members of Status are embedded into this type.)Conditions for the InferenceService
|
address knative.dev/pkg/apis/duck/v1.Addressable | Addressable endpoint for the InferenceService |
url knative.dev/pkg/apis.URL | URL holds the url that will distribute traffic over the provided traffic targets. It generally has the form http[s]://{route-name}.{route-namespace}.{cluster-level-suffix} |
components [Required]map[ComponentType]ComponentStatusSpec | Statuses for the components of the InferenceService |
modelStatus [Required]ModelStatus | Model related statuses |
KedaConfig
Appears in:
KedaConfig stores the configuration settings for KEDA autoscaling within the InferenceService. It includes fields like the Prometheus server address, custom query, scaling threshold, and operator.
Field | Description |
---|---|
enableKeda [Required]bool | EnableKeda determines whether KEDA autoscaling is enabled for the InferenceService.
|
promServerAddress [Required]string | PromServerAddress specifies the address of the Prometheus server that KEDA will query to retrieve metrics for autoscaling decisions. This should be a fully qualified URL, including the protocol and port number. Example: http://prometheus-operated.monitoring.svc.cluster.local:9090 |
customPromQuery [Required]string | CustomPromQuery defines a custom Prometheus query that KEDA will execute to evaluate the desired metric for scaling. This query should return a single numerical value that represents the metric to be monitored. Example: avg_over_time(http_requests_total{service="llama"}[5m]) |
scalingThreshold [Required]string | ScalingThreshold sets the numerical threshold against which the result of the Prometheus query will be compared. Depending on the ScalingOperator, this threshold determines when to scale the number of replicas up or down. Example: "10" - The Autoscaler will compare the metric value to 10. |
scalingOperator [Required]string | ScalingOperator specifies the comparison operator used by KEDA to decide whether to scale the Deployment. Common operators include:
This operator defines the condition under which scaling actions are triggered based on the evaluated metric. Example: "GreaterThanOrEqual" |
LeaderSpec
Appears in:
LeaderSpec defines the configuration for a leader node in a multi-node component The leader node coordinates the activities of worker nodes in distributed inference or token generation setups, handling task distribution and result aggregation.
Field | Description |
---|---|
PodSpec PodSpec | (Members of PodSpec are embedded into this type.)Pod specification for the leader node This overrides the main PodSpec when specified Allows customization of the Kubernetes Pod configuration specifically for the leader node. |
runner RunnerSpec | Runner container override for customizing the main container This is essentially a container spec that can override the default container Provides fine-grained control over the container that executes the leader node's coordination logic. |
LifeCycleState
(Alias of string
)
Appears in:
LifeCycleState enum
ModelCopies
Appears in:
Field | Description |
---|---|
failedCopies [Required]int | How many copies of this predictor's models failed to load recently |
totalCopies int | Total number copies of this predictor's models that are currently loaded |
ModelExtensionSpec
Appears in:
Field | Description |
---|---|
displayName string | DisplayName is the user-friendly name of the model |
version string | No description provided. |
disabled bool | Whether the model is enabled or not |
vendor string | Vendor of the model, e.g., "NVIDIA", "Meta", "HuggingFace" |
compartmentID string | CompartmentID is the compartment ID of the model |
ModelFormat
Appears in:
Field | Description |
---|---|
name [Required]string | Name of the format in which the model is stored, e.g., "ONNX", "TensorFlow SavedModel", "PyTorch", "SafeTensors" |
version string | Version of the model format. Used in validating that a runtime supports a predictor. It Can be "major", "major.minor" or "major.minor.patch". |
ModelFrameworkSpec
Appears in:
Field | Description |
---|---|
name [Required]string | Name of the library in which the model is stored, e.g., "ONNXRuntime", "TensorFlow", "PyTorch", "Transformer", "TensorRTLLM" |
version string | Version of the library. Used in validating that a runtime supports a predictor. It Can be "major", "major.minor" or "major.minor.patch". |
ModelQuantization
(Alias of string
)
Appears in:
ModelRef
Appears in:
Field | Description |
---|---|
name [Required]string | Name of the model being referenced Identifies the specific model to be used for inference. |
kind [Required]string | Kind of the model being referenced Defaults to ClusterBaseModel Specifies the Kubernetes resource kind of the referenced model. |
apiGroup [Required]string | APIGroup of the resource being referenced
Defaults to |
fineTunedWeights []string | Optional FineTunedWeights references References to fine-tuned weights that should be applied to the base model. |
ModelRevisionStates
Appears in:
Field | Description |
---|---|
activeModelState [Required]ModelState | High level state string: Pending, Standby, Loading, Loaded, FailedToLoad |
targetModelState [Required]ModelState | No description provided. |
ModelSizeRangeSpec
Appears in:
ModelSizeRangeSpec defines the range of model sizes supported by this runtime
Field | Description |
---|---|
min string | Minimum size of the model in bytes |
max string | Maximum size of the model in bytes |
ModelSpec
Appears in:
Field | Description |
---|---|
runtime string | Specific ClusterServingRuntime/ServingRuntime name to use for deployment. |
PredictorExtensionSpec [Required]PredictorExtensionSpec | (Members of PredictorExtensionSpec are embedded into this type.)
No description provided. |
baseModel [Required]string | No description provided. |
fineTunedWeights [Required][]string | No description provided. |
ModelState
(Alias of string
)
Appears in:
ModelState enum
ModelStatus
Appears in:
Field | Description |
---|---|
transitionStatus [Required]TransitionStatus | Whether the available predictor endpoints reflect the current Spec or is in transition |
modelRevisionStates ModelRevisionStates | State information of the predictor's model. |
lastFailureInfo FailureInfo | Details of last failure, when load of target model is failed or blocked. |
modelCopies ModelCopies | Model copy information of the predictor's model. |
ModelStatusSpec
Appears in:
ModelStatusSpec defines the observed state of Model weight
Field | Description |
---|---|
lifecycle [Required]string | LifeCycle is an enum of Deprecated, Experiment, Public, Internal |
state [Required]LifeCycleState | Status of the model weight |
nodesReady [Required][]string | No description provided. |
nodesFailed [Required][]string | No description provided. |
ObjectReference
Appears in:
ObjectReference contains enough information to let you inspect or modify the referred object.
Field | Description |
---|---|
name [Required]string | Name of the referenced object |
namespace [Required]string | Namespace of the referenced object |
PodOverride
Appears in:
Field | Description |
---|---|
image string | Image specifies the container image to use for the benchmark job. |
env []k8s.io/api/core/v1.EnvVar | List of environment variables to set in the container. |
envFrom []k8s.io/api/core/v1.EnvFromSource | List of sources to populate environment variables in the container. |
volumeMounts []k8s.io/api/core/v1.VolumeMount | Pod volumes to mount into the container's filesystem. |
resources k8s.io/api/core/v1.ResourceRequirements | Compute Resources required by this container. Cannot be updated. More info: https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/ |
tolerations []k8s.io/api/core/v1.Toleration | If specified, the pod's tolerations. |
nodeSelector map[string]string | NodeSelector is a selector which must be true for the pod to fit on a node. Selector which must match a node's labels for the pod to be scheduled on that node. More info: https://kubernetes.io/docs/concepts/configuration/assign-pod-node/ |
affinity k8s.io/api/core/v1.Affinity | If specified, the pod's scheduling constraints |
volumes []k8s.io/api/core/v1.Volume | List of volumes that can be mounted by containers belonging to the pod. More info: https://kubernetes.io/docs/concepts/storage/volumes |
PodSpec
Appears in:
PodSpec is a description of a pod.
Field | Description |
---|---|
volumes []k8s.io/api/core/v1.Volume | List of volumes that can be mounted by containers belonging to the pod. More info: https://kubernetes.io/docs/concepts/storage/volumes |
initContainers [Required][]k8s.io/api/core/v1.Container | List of initialization containers belonging to the pod. Init containers are executed in order prior to containers being started. If any init container fails, the pod is considered to have failed and is handled according to its restartPolicy. The name for an init container or normal container must be unique among all containers. Init containers may not have Lifecycle actions, Readiness probes, Liveness probes, or Startup probes. The resourceRequirements of an init container are taken into account during scheduling by finding the highest request/limit for each resource type, and then using the max of of that value or the sum of the normal containers. Limits are applied to init containers in a similar fashion. Init containers cannot currently be added or removed. Cannot be updated. More info: https://kubernetes.io/docs/concepts/workloads/pods/init-containers/ |
containers [Required][]k8s.io/api/core/v1.Container | List of containers belonging to the pod. Containers cannot currently be added or removed. There must be at least one container in a Pod. Cannot be updated. |
ephemeralContainers []k8s.io/api/core/v1.EphemeralContainer | List of ephemeral containers run in this pod. Ephemeral containers may be run in an existing pod to perform user-initiated actions such as debugging. This list cannot be specified when creating a pod, and it cannot be modified by updating the pod spec. In order to add an ephemeral container to an existing pod, use the pod's ephemeralcontainers subresource. |
restartPolicy k8s.io/api/core/v1.RestartPolicy | Restart policy for all containers within the pod. One of Always, OnFailure, Never. In some contexts, only a subset of those values may be permitted. Default to Always. More info: https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#restart-policy |
terminationGracePeriodSeconds int64 | Optional duration in seconds the pod needs to terminate gracefully. May be decreased in delete request. Value must be non-negative integer. The value zero indicates stop immediately via the kill signal (no opportunity to shut down). If this value is nil, the default grace period will be used instead. The grace period is the duration in seconds after the processes running in the pod are sent a termination signal and the time when the processes are forcibly halted with a kill signal. Set this value longer than the expected cleanup time for your process. Defaults to 30 seconds. |
activeDeadlineSeconds int64 | Optional duration in seconds the pod may be active on the node relative to StartTime before the system will actively try to mark it failed and kill associated containers. Value must be a positive integer. |
dnsPolicy k8s.io/api/core/v1.DNSPolicy | Set DNS policy for the pod. Defaults to "ClusterFirst". Valid values are 'ClusterFirstWithHostNet', 'ClusterFirst', 'Default' or 'None'. DNS parameters given in DNSConfig will be merged with the policy selected with DNSPolicy. To have DNS options set along with hostNetwork, you have to specify DNS policy explicitly to 'ClusterFirstWithHostNet'. |
nodeSelector map[string]string | NodeSelector is a selector which must be true for the pod to fit on a node. Selector which must match a node's labels for the pod to be scheduled on that node. More info: https://kubernetes.io/docs/concepts/configuration/assign-pod-node/ |
serviceAccountName string | ServiceAccountName is the name of the ServiceAccount to use to run this pod. More info: https://kubernetes.io/docs/tasks/configure-pod-container/configure-service-account/ |
serviceAccount string | DeprecatedServiceAccount is a deprecated alias for ServiceAccountName. Deprecated: Use serviceAccountName instead. |
automountServiceAccountToken bool | AutomountServiceAccountToken indicates whether a service account token should be automatically mounted. |
nodeName string | NodeName indicates in which node this pod is scheduled. If empty, this pod is a candidate for scheduling by the scheduler defined in schedulerName. Once this field is set, the kubelet for this node becomes responsible for the lifecycle of this pod. This field should not be used to express a desire for the pod to be scheduled on a specific node. https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#nodename |
hostNetwork bool | Host networking requested for this pod. Use the host's network namespace. If this option is set, the ports that will be used must be specified. Default to false. |
hostPID bool | Use the host's pid namespace. Optional: Default to false. |
hostIPC bool | Use the host's ipc namespace. Optional: Default to false. |
shareProcessNamespace bool | Share a single process namespace between all of the containers in a pod. When this is set containers will be able to view and signal processes from other containers in the same pod, and the first process in each container will not be assigned PID 1. HostPID and ShareProcessNamespace cannot both be set. Optional: Default to false. |
securityContext k8s.io/api/core/v1.PodSecurityContext | SecurityContext holds pod-level security attributes and common container settings. Optional: Defaults to empty. See type description for default values of each field. |
imagePullSecrets []k8s.io/api/core/v1.LocalObjectReference | ImagePullSecrets is an optional list of references to secrets in the same namespace to use for pulling any of the images used by this PodSpec. If specified, these secrets will be passed to individual puller implementations for them to use. More info: https://kubernetes.io/docs/concepts/containers/images#specifying-imagepullsecrets-on-a-pod |
hostname string | Specifies the hostname of the Pod If not specified, the pod's hostname will be set to a system-defined value. |
subdomain string | If specified, the fully qualified Pod hostname will be "...svc.". If not specified, the pod will not have a domainname at all. |
affinity k8s.io/api/core/v1.Affinity | If specified, the pod's scheduling constraints |
schedulerName string | If specified, the pod will be dispatched by specified scheduler. If not specified, the pod will be dispatched by default scheduler. |
tolerations []k8s.io/api/core/v1.Toleration | If specified, the pod's tolerations. |
hostAliases []k8s.io/api/core/v1.HostAlias | HostAliases is an optional list of hosts and IPs that will be injected into the pod's hosts file if specified. |
priorityClassName string | If specified, indicates the pod's priority. "system-node-critical" and "system-cluster-critical" are two special keywords which indicate the highest priorities with the former being the highest priority. Any other name must be defined by creating a PriorityClass object with that name. If not specified, the pod priority will be default or zero if there is no default. |
priority int32 | The priority value. Various system components use this field to find the priority of the pod. When Priority Admission Controller is enabled, it prevents users from setting this field. The admission controller populates this field from PriorityClassName. The higher the value, the higher the priority. |
dnsConfig k8s.io/api/core/v1.PodDNSConfig | Specifies the DNS parameters of a pod. Parameters specified here will be merged to the generated DNS configuration based on DNSPolicy. |
readinessGates []k8s.io/api/core/v1.PodReadinessGate | If specified, all readiness gates will be evaluated for pod readiness. A pod is ready when all its containers are ready AND all conditions specified in the readiness gates have status equal to "True" More info: https://git.k8s.io/enhancements/keps/sig-network/580-pod-readiness-gates |
runtimeClassName string | RuntimeClassName refers to a RuntimeClass object in the node.k8s.io group, which should be used to run this pod. If no RuntimeClass resource matches the named class, the pod will not be run. If unset or empty, the "legacy" RuntimeClass will be used, which is an implicit class with an empty definition that uses the default runtime handler. More info: https://git.k8s.io/enhancements/keps/sig-node/585-runtime-class |
enableServiceLinks bool | EnableServiceLinks indicates whether information about services should be injected into pod's environment variables, matching the syntax of Docker links. Optional: Defaults to true. |
preemptionPolicy k8s.io/api/core/v1.PreemptionPolicy | PreemptionPolicy is the Policy for preempting pods with lower priority. One of Never, PreemptLowerPriority. Defaults to PreemptLowerPriority if unset. |
overhead k8s.io/api/core/v1.ResourceList | Overhead represents the resource overhead associated with running a pod for a given RuntimeClass. This field will be autopopulated at admission time by the RuntimeClass admission controller. If the RuntimeClass admission controller is enabled, overhead must not be set in Pod create requests. The RuntimeClass admission controller will reject Pod create requests which have the overhead already set. If RuntimeClass is configured and selected in the PodSpec, Overhead will be set to the value defined in the corresponding RuntimeClass, otherwise it will remain unset and treated as zero. More info: https://git.k8s.io/enhancements/keps/sig-node/688-pod-overhead/README.md |
topologySpreadConstraints []k8s.io/api/core/v1.TopologySpreadConstraint | TopologySpreadConstraints describes how a group of pods ought to spread across topology domains. Scheduler will schedule pods in a way which abides by the constraints. All topologySpreadConstraints are ANDed. |
setHostnameAsFQDN bool | If true the pod's hostname will be configured as the pod's FQDN, rather than the leaf name (the default). In Linux containers, this means setting the FQDN in the hostname field of the kernel (the nodename field of struct utsname). In Windows containers, this means setting the registry value of hostname for the registry key HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters to FQDN. If a pod does not have FQDN, this has no effect. Default to false. |
os k8s.io/api/core/v1.PodOS | Specifies the OS of the containers in the pod. Some pod and container fields are restricted if this is set. If the OS field is set to linux, the following fields must be unset: -securityContext.windowsOptions If the OS field is set to windows, following fields must be unset:
|
hostUsers bool | Use the host's user namespace. Optional: Default to true. If set to true or not present, the pod will be run in the host user namespace, useful for when the pod needs a feature only available to the host user namespace, such as loading a kernel module with CAP_SYS_MODULE. When set to false, a new userns is created for the pod. Setting false is useful for mitigating container breakout vulnerabilities even allowing users to run their containers as root without actually having root privileges on the host. This field is alpha-level and is only honored by servers that enable the UserNamespacesSupport feature. |
schedulingGates []k8s.io/api/core/v1.PodSchedulingGate | SchedulingGates is an opaque list of values that if specified will block scheduling the pod. If schedulingGates is not empty, the pod will stay in the SchedulingGated state and the scheduler will not attempt to schedule the pod. SchedulingGates can only be set at pod creation time, and be removed only afterwards. |
resourceClaims []k8s.io/api/core/v1.PodResourceClaim | ResourceClaims defines which ResourceClaims must be allocated and reserved before the Pod is allowed to start. The resources will be made available to those containers which consume them by name. This is an alpha field and requires enabling the DynamicResourceAllocation feature gate. This field is immutable. |
PredictorExtensionSpec
Appears in:
PredictorExtensionSpec defines configuration shared across all predictor frameworks
Field | Description |
---|---|
storageUri string | This field points to the location of the model which is mounted onto the pod. |
runtimeVersion string | Runtime version of the predictor docker image |
protocolVersion github.com/sgl-project/ome/pkg/constants.InferenceServiceProtocol | Protocol version to use by the predictor (i.e. v1 or v2 or grpc-v1 or grpc-v2) |
Container k8s.io/api/core/v1.Container | (Members of Container are embedded into this type.)Container enables overrides for the predictor. Each framework will have different defaults that are populated in the underlying container spec. |
PredictorSpec
Appears in:
PredictorSpec defines the configuration for a predictor, The following fields follow a "1-of" semantic. Users must specify exactly one spec.
Field | Description |
---|---|
model [Required]ModelSpec | Model spec for any arbitrary framework. |
PodSpec [Required]PodSpec | (Members of PodSpec are embedded into this type.)This spec is dual purpose.
|
ComponentExtensionSpec [Required]ComponentExtensionSpec | (Members of ComponentExtensionSpec are embedded into this type.)Component extension defines the deployment configurations for a predictor |
workerSpec WorkerSpec | WorkerSpec for the predictor, this is used for multi-node serving without Ray Cluster |
RouterSpec
Appears in:
RouterSpec defines the configuration for the Router component, which handles request routing
Field | Description |
---|---|
PodSpec [Required]PodSpec | (Members of PodSpec are embedded into this type.)PodSpec defines the container configuration for the router |
ComponentExtensionSpec [Required]ComponentExtensionSpec | (Members of ComponentExtensionSpec are embedded into this type.)ComponentExtensionSpec defines deployment configuration like min/max replicas, scaling metrics, etc. |
runner RunnerSpec | This is essentially a container spec that can override the default container |
config map[string]string | Additional configuration parameters for the runner This can include framework-specific settings |
RunnerSpec
Appears in:
RunnerSpec defines container configuration plus additional config settings The Runner is the primary container that executes the model serving or token generation logic.
Field | Description |
---|---|
Container k8s.io/api/core/v1.Container | (Members of Container are embedded into this type.)Container spec for the runner Provides complete Kubernetes container configuration for the primary execution container. |
ScaleMetric
(Alias of string
)
Appears in:
ScaleMetric enum
ServiceMetadata
Appears in:
ServiceMetadata contains metadata fields for recording the backend model server's configuration and version details. This information helps track experiment context, enabling users to filter and query experiments based on server properties.
Field | Description |
---|---|
engine [Required]string | Engine specifies the backend model server engine. Supported values: "vLLM", "SGLang", "TGI". |
version [Required]string | Version specifies the version of the model server (e.g., "0.5.3"). |
gpuType [Required]string | GpuType specifies the type of GPU used by the model server. Supported values: "H100", "A100", "MI300", "A10". |
gpuCount [Required]int | GpuCount indicates the number of GPU cards available on the model server. |
ServingRuntimePodSpec
Appears in:
Field | Description |
---|---|
containers []k8s.io/api/core/v1.Container | List of containers belonging to the pod. Containers cannot currently be added or removed. Cannot be updated. |
volumes []k8s.io/api/core/v1.Volume | List of volumes that can be mounted by containers belonging to the pod. More info: https://kubernetes.io/docs/concepts/storage/volumes |
nodeSelector map[string]string | NodeSelector is a selector which must be true for the pod to fit on a node. Selector which must match a node's labels for the pod to be scheduled on that node. More info: https://kubernetes.io/docs/concepts/configuration/assign-pod-node/ |
affinity k8s.io/api/core/v1.Affinity | If specified, the pod's scheduling constraints |
tolerations []k8s.io/api/core/v1.Toleration | If specified, the pod's tolerations. |
labels map[string]string | Labels that will be add to the pod. More info: http://kubernetes.io/docs/user-guide/labels |
annotations map[string]string | Annotations that will be add to the pod. More info: http://kubernetes.io/docs/user-guide/annotations |
imagePullSecrets []k8s.io/api/core/v1.LocalObjectReference | ImagePullSecrets is an optional list of references to secrets in the same namespace to use for pulling any of the images used by this PodSpec. If specified, these secrets will be passed to individual puller implementations for them to use. More info: https://kubernetes.io/docs/concepts/containers/images#specifying-imagepullsecrets-on-a-pod |
schedulerName string | If specified, the pod will be dispatched by specified scheduler. If not specified, the pod will be dispatched by default scheduler. |
hostIPC bool | Use the host's ipc namespace. Optional: Default to false. |
dnsPolicy k8s.io/api/core/v1.DNSPolicy | Set DNS policy for the pod. Defaults to "ClusterFirst". Valid values are 'ClusterFirstWithHostNet', 'ClusterFirst', 'Default' or 'None'. DNS parameters given in DNSConfig will be merged with the policy selected with DNSPolicy. To have DNS options set along with hostNetwork, you have to specify DNS policy explicitly to 'ClusterFirstWithHostNet'. |
hostNetwork bool | Host networking requested for this pod. Use the host's network namespace. If this option is set, the ports that will be used must be specified. Default to false. |
ServingRuntimeRef
Appears in:
Field | Description |
---|---|
name [Required]string | Name of the runtime being referenced Identifies the specific runtime environment to be used for model execution. |
kind [Required]string | Kind of the runtime being referenced Defaults to ClusterServingRuntime Specifies the Kubernetes resource kind of the referenced runtime. ClusterServingRuntime is a cluster-wide runtime, while ServingRuntime is namespace-scoped. |
apiGroup [Required]string | APIGroup of the resource being referenced
Defaults to |
ServingRuntimeSpec
Appears in:
ServingRuntimeSpec defines the desired state of ServingRuntime. This spec is currently provisional and are subject to change as details regarding single-model serving and multi-model serving are hammered out.
Field | Description |
---|---|
supportedModelFormats [Required][]SupportedModelFormat | Model formats and version supported by this runtime |
modelSizeRange ModelSizeRangeSpec | ModelSizeRange is the range of model sizes supported by this runtime |
disabled bool | Set to true to disable use of this runtime |
routerConfig RouterSpec | Router configuration for this runtime |
engineConfig EngineSpec | Engine configuration for this runtime |
decoderConfig DecoderSpec | Decoder configuration for this runtime |
protocolVersions []github.com/sgl-project/ome/pkg/constants.InferenceServiceProtocol | Supported protocol versions (i.e. openAI or cohere or openInference-v1 or openInference-v2) |
ServingRuntimePodSpec [Required]ServingRuntimePodSpec | (Members of ServingRuntimePodSpec are embedded into this type.)PodSpec for the serving runtime |
workers WorkerPodSpec | WorkerPodSpec for the serving runtime, this is used for multi-node serving without Ray Cluster |
ServingRuntimeStatus
Appears in:
ServingRuntimeStatus defines the observed state of ServingRuntime
StorageSpec
Appears in:
Field | Description |
---|---|
path string | Path is the absolute path where the model will be downloaded and stored on the node. |
schemaPath string | SchemaPath is the path to the model schema or configuration file within the storage system. This can be used to validate the model or customize how it's loaded. |
parameters map[string]string | Parameters contain key-value pairs to override default storage credentials or configuration. These values are typically used to configure access to object storage or mount options. |
key string | StorageKey is the name of the key in a Kubernetes Secret used to authenticate access to the model storage. This key will be used to fetch credentials during model download or access. |
storageUri [Required]string | StorageUri specifies the source URI of the model in a supported storage backend. Supported formats:
|
nodeSelector map[string]string | NodeSelector defines a set of key-value label pairs that must be present on a node for the model to be scheduled and downloaded onto that node. |
nodeAffinity k8s.io/api/core/v1.NodeAffinity | NodeAffinity describes the node affinity rules that further constrain which nodes are eligible to download and store this model, based on advanced scheduling policies. |
SupportedModelFormat
Appears in:
Field | Description |
---|---|
name string | TODO this field is being used as model format name, and this is not correct, we should deprecate this and use Name from ModelFormat Name of the model |
modelFormat [Required]ModelFormat | ModelFormat of the model, e.g., "PyTorch", "TensorFlow", "ONNX", "SafeTensors" |
modelType string | DEPRECATED: This field is deprecated and will be removed in future releases. |
version string | Version of the model format. Used in validating that a runtime supports a predictor. It Can be "major", "major.minor" or "major.minor.patch". |
modelFramework [Required]ModelFrameworkSpec | ModelFramework of the model, e.g., "PyTorch", "TensorFlow", "ONNX", "Transformers" |
modelArchitecture string | ModelArchitecture of the model, e.g., "LlamaForCausalLM", "GemmaForCausalLM", "MixtralForCausalLM" |
quantization ModelQuantization | Quantization of the model, e.g., "fp8", "fbgemm_fp8", "int4" |
autoSelect bool | Set to true to allow the ServingRuntime to be used for automatic model placement if this model format is specified with no explicit runtime. |
priority int32 | Priority of this serving runtime for auto selection. This is used to select the serving runtime if more than one serving runtime supports the same model format. The value should be greater than zero. The higher the value, the higher the priority. Priority is not considered if AutoSelect is either false or not specified. Priority can be overridden by specifying the runtime in the InferenceService. |
TransitionStatus
(Alias of string
)
Appears in:
TransitionStatus enum
WorkerPodSpec
Appears in:
Field | Description |
---|---|
size int | Size of the worker, this is the number of pods in the worker. |
ServingRuntimePodSpec ServingRuntimePodSpec | (Members of ServingRuntimePodSpec are embedded into this type.)PodSpec for the worker |
WorkerSpec
Appears in:
WorkerSpec defines the configuration for worker nodes in a multi-node component Worker nodes perform the distributed processing tasks assigned by the leader node, enabling horizontal scaling for compute-intensive workloads.
Field | Description |
---|---|
PodSpec PodSpec | (Members of PodSpec are embedded into this type.)PodSpec for the worker Allows customization of the Kubernetes Pod configuration specifically for worker nodes. |
size int | Size of the worker, this is the number of pods in the worker. Controls how many worker pod instances will be deployed for horizontal scaling. |
runner RunnerSpec | Runner container override for customizing the main container This is essentially a container spec that can override the default container Provides fine-grained control over the container that executes the worker node's processing logic. |
Was this page helpful?
Glad to hear it! Please tell us how we can improve.
Sorry to hear that. Please tell us how we can improve.