Ingress and External Access

Configure external access to your AI inference services through ingress controllers, load balancers, and service routing.

What is Ingress in OME?

Ingress in OME provides external access to your AI inference services running inside the Kubernetes cluster. When you deploy an InferenceService, OME automatically creates the appropriate ingress resources based on your deployment mode and cluster configuration, allowing external clients to make API calls to your models.

Think of ingress as the “front door” to your AI services - it handles incoming HTTP requests from outside the cluster and routes them to the correct model endpoints inside your cluster.

Deployment Modes and Ingress Types

OME supports three different ingress strategies depending on how you deploy your inference services:

Serverless Mode (Knative)

Best for: Variable workloads, auto-scaling, cost optimization

When deploying in serverless mode, OME integrates with Knative and creates Istio VirtualService resources for routing:

apiVersion: ome.io/v1beta1
kind: InferenceService
metadata:
  name: llama-chat
spec:
  engine:
    model: llama-3-70b-instruct
    resources:
      requests:
        nvidia.com/gpu: 1

This automatically creates external access at:

https://llama-chat.your-namespace.example.com

Raw Deployment Mode

Best for: Consistent workloads, dedicated resources, custom configurations

Raw deployments use standard Kubernetes ingress controllers:

apiVersion: ome.io/v1beta1
kind: InferenceService
metadata:
  name: llama-chat
  annotations:
    ome.io/deployment-mode: "RawDeployment"
spec:
  engine:
    model: llama-3-70b-instruct
    resources:
      requests:
        nvidia.com/gpu: 2

This creates a Kubernetes Ingress resource accessible at:

https://llama-chat.your-namespace.your-cluster.com

MultiNode Mode

Best for: Large models requiring multiple GPUs, distributed inference

MultiNode deployments support the same ingress options as Raw deployments:

apiVersion: ome.io/v1beta1
kind: InferenceService
metadata:
  name: llama-405b
  annotations:
    ome.io/deployment-mode: "MultiNode"
spec:
  engine:
    model: llama-3-405b-instruct
    resources:
      requests:
        nvidia.com/gpu: 8
    parallelism: 4  # Distributed across 4 nodes

Component-Based Routing

OME’s inference services can include multiple components that work together. The ingress system automatically creates the right routing rules based on which components you deploy:

Engine Only (Basic Inference)

spec:
  engine:
    model: llama-3-70b-instruct

Routing: All requests → Engine service

Engine + Router (Advanced Inference)

spec:
  engine:
    model: llama-3-70b-instruct
  router:
    template:
      spec:
        containers:
        - name: router
          image: custom-router:latest

Routing:

  • Top-level requests → Router service
  • Router processes and forwards → Engine service

Engine + Decoder (Post-Processing)

spec:
  engine:
    model: llama-3-70b-instruct
  decoder:
    template:
      spec:
        containers:
        - name: decoder
          image: custom-decoder:latest

Routing:

  • Top-level requests → Engine service
  • Decoder-specific requests (/v1/decoder/...) → Decoder service

Full Pipeline (Engine + Router + Decoder)

spec:
  engine:
    model: llama-3-70b-instruct
  router:
    template:
      spec:
        containers:
        - name: router
          image: custom-router:latest
  decoder:
    template:
      spec:
        containers:
        - name: decoder
          image: custom-decoder:latest

Routing:

  • Top-level requests → Router service
  • Router-specific requests (/v1/router/...) → Router service
  • Decoder-specific requests (/v1/decoder/...) → Decoder service

Making API Calls

Once your inference service is deployed and ingress is configured, you can make API calls using standard HTTP clients:

Basic Text Generation

# Get the ingress URL
kubectl get inferenceservice llama-chat -o jsonpath='{.status.url}'

# Make a completion request
curl -X POST https://llama-chat.your-namespace.example.com/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama-3-70b-instruct",
    "prompt": "Explain quantum computing in simple terms:",
    "max_tokens": 150,
    "temperature": 0.7
  }'

Chat Completions

curl -X POST https://llama-chat.your-namespace.example.com/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama-3-70b-instruct",
    "messages": [
      {"role": "user", "content": "What is machine learning?"}
    ],
    "max_tokens": 200
  }'

Component-Specific Endpoints

When using services with multiple components, you can access specific endpoints:

# Router-specific endpoint
curl -X POST https://llama-chat.your-namespace.example.com/v1/router/route \
  -H "Content-Type: application/json" \
  -d '{"request": "route this to the best model"}'

# Decoder-specific endpoint  
curl -X POST https://llama-chat.your-namespace.example.com/v1/decoder/decode \
  -H "Content-Type: application/json" \
  -d '{"tokens": [1, 2, 3, 4], "decode_format": "text"}'

# Health checks
curl -H "Accept: application/json" \
  https://llama-chat.your-namespace.example.com/health
curl -H "Accept: application/json" \
  https://llama-chat.your-namespace.example.com/v1/router/health
curl -H "Accept: application/json" \
  https://llama-chat.your-namespace.example.com/v1/decoder/health

Model Information

# List available models
curl -H "Accept: application/json" \
  https://llama-chat.your-namespace.example.com/v1/models

# Get model details
curl -H "Accept: application/json" \
  https://llama-chat.your-namespace.example.com/v1/models/llama-3-70b-instruct

Ingress Configuration Options

Custom Ingress Class

apiVersion: ome.io/v1beta1
kind: InferenceService
metadata:
  name: llama-chat
  annotations:
    ome.io/ingress-class: "nginx"  # Use nginx instead of default istio
spec:
  engine:
    model: llama-3-70b-instruct

Cluster-Local Services (Internal Only)

apiVersion: ome.io/v1beta1
kind: InferenceService
metadata:
  name: internal-llama
  labels:
    networking.knative.dev/visibility: cluster-local
spec:
  engine:
    model: llama-3-70b-instruct

This creates a service accessible only from within the cluster:

# From inside the cluster
curl -X POST http://internal-llama.your-namespace.svc.cluster.local/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama-3-70b-instruct",
    "prompt": "Internal cluster request",
    "max_tokens": 100
  }'

Load Balancer Services

apiVersion: ome.io/v1beta1
kind: InferenceService
metadata:
  name: llama-chat
  annotations:
    ome.io/disable-ingress: "true"
    ome.io/service-type: "LoadBalancer"
spec:
  engine:
    model: llama-3-70b-instruct

Troubleshooting Ingress

Check Service Status

# Check inference service status
kubectl get inferenceservice llama-chat -o yaml

# Look for ingress readiness condition
kubectl get inferenceservice llama-chat -o jsonpath='{.status.conditions[?(@.type=="IngressReady")]}'

Verify Ingress Resources

# Check ingress resources
kubectl get ingress -l ome.io/inferenceservice=llama-chat

# For serverless mode, check VirtualService
kubectl get virtualservice -l ome.io/inferenceservice=llama-chat

# For Gateway API
kubectl get httproute -l ome.io/inferenceservice=llama-chat

Test Internal Connectivity

# Test engine service directly
kubectl port-forward service/llama-chat-engine 8080:80
curl -H "Accept: application/json" http://localhost:8080/health

# Test router service (if exists)
kubectl port-forward service/llama-chat 8080:80
curl -H "Accept: application/json" http://localhost:8080/health

Common Issues

Ingress Not Created: Check that ingress isn’t disabled and components are ready:

kubectl get inferenceservice llama-chat -o yaml | grep -A 5 "conditions:"

503/404 Errors: Verify target services exist and are healthy:

kubectl get services -l ome.io/inferenceservice=llama-chat
kubectl get pods -l ome.io/inferenceservice=llama-chat

DNS Issues: Ensure your ingress controller is properly configured with DNS:

kubectl get ingress llama-chat -o yaml
nslookup llama-chat.your-namespace.example.com

Per-Service Ingress Configuration

While cluster-wide ingress settings are configured in the OME ConfigMap, you can override specific ingress settings per InferenceService using annotations. This allows flexible customization without changing cluster defaults.

Available Annotation Overrides

AnnotationPurposeExample
ome.io/ingress-domain-templateCustom domain template{{.Name}}.{{.Namespace}}.ml.example.com
ome.io/ingress-domainFixed ingress domainml-services.example.com
ome.io/ingress-additional-domainsAdditional domains (comma-separated)backup.example.com,alt.example.com
ome.io/ingress-url-schemeURL scheme overridehttps
ome.io/ingress-path-templateCustom path template/models/{{.Name}}/{{.Namespace}}
ome.io/ingress-disable-istio-virtualhostDisable Istio VirtualService"true"
ome.io/ingress-disable-creationSkip ingress creation entirely"true"

Security Considerations

Authentication

OME ingress supports various authentication methods:

# Using API keys (if configured)
curl -X POST https://llama-chat.your-namespace.example.com/v1/completions \
  -H "Authorization: Bearer your-api-key" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama-3-70b-instruct",
    "prompt": "API key authentication",
    "max_tokens": 100
  }'

# Using basic auth (if configured)  
curl -X POST https://llama-chat.your-namespace.example.com/v1/completions \
  -u username:password \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama-3-70b-instruct",
    "prompt": "Basic auth request",
    "max_tokens": 100
  }'

TLS/SSL

All external ingress endpoints should use HTTPS in production. Check your ingress controller documentation for TLS certificate configuration.

Next Steps

For production deployments and cluster configuration, see the Administration section.


Last modified June 26, 2025: Fix all doc prefixes (#79) (abf6c54)