Deploy a Simple Inference Service
This page shows you how to deploy a simple inference service using OME. You’ll learn how to create an InferenceService that serves a pre-trained model for real-time inference using SGLang and OpenAI-compatible APIs.
Before you begin
You need to have the following:
- A Kubernetes cluster with OME installed
kubectl
configured to communicate with your cluster- GPU nodes available in your cluster (A100, H100, H200, or B4)
- Access to OME container registry (
ghcr.io/sgl-project/
)
Step 1: Verify prerequisites
Check that OME is installed and running:
kubectl get pods -n ome
Expected output:
NAME READY STATUS RESTARTS AGE
ome-controller-manager-xxx 2/2 Running 0 5m
ome-model-controller-xxx 1/1 Running 0 5m
ome-model-agent-daemonset-xxx 1/1 Running 0 5m
Check available serving runtimes:
kubectl get clusterservingruntimes
Example output:
NAME AGE
srt-llama-3-2-1b-instruct 1d
srt-llama-3-2-3b-instruct 1d
srt-llama-3-3-70b-instruct 1d
srt-deepseek-r1 1d
srt-mistral-7b-instruct 1d
Verify GPU availability:
kubectl get nodes -o custom-columns="NAME:.metadata.name,GPU:.status.allocatable.nvidia\.com/gpu"
Step 2: Deploy a small model (1B parameters)
Let’s start with a small model that requires only one GPU:
kubectl apply -f - <<EOF
apiVersion: v1
kind: Namespace
metadata:
name: llama-1b-demo
---
apiVersion: ome.io/v1beta1
kind: InferenceService
metadata:
name: llama-3-2-1b-instruct
namespace: llama-1b-demo
spec:
predictor:
model:
baseModel: llama-3-2-1b-instruct
protocolVersion: openAI
minReplicas: 1
maxReplicas: 1
EOF
Step 3: Monitor deployment progress
Check the deployment status:
kubectl get inferenceservice -n llama-1b-demo
Monitor the pods:
kubectl get pods -n llama-1b-demo -w
Check the events for troubleshooting:
kubectl get events -n llama-1b-demo --sort-by=.metadata.creationTimestamp
The deployment is ready when the pod status shows Running
and the readiness probe passes.
Step 4: Test the service
Method 1: Port Forward (for testing)
Forward the service port to your local machine:
kubectl port-forward -n llama-1b-demo svc/llama-3-2-1b-instruct 8080:8080
Test with a simple chat completion:
curl -X POST "http://localhost:8080/v1/chat/completions" \
-H "Content-Type: application/json" \
-d '{
"model": "llama-3-2-1b-instruct",
"messages": [
{"role": "user", "content": "Hello! Can you introduce yourself?"}
],
"max_tokens": 100,
"temperature": 0.7
}'
Method 2: In-Cluster Access
Create a test pod to access the service:
kubectl run test-client --rm -i --tty --image=curlimages/curl -- /bin/sh
From within the pod:
curl -X POST "http://llama-3-2-1b-instruct.llama-1b-demo:8080/v1/chat/completions" \
-H "Content-Type: application/json" \
-d '{
"model": "llama-3-2-1b-instruct",
"messages": [
{"role": "user", "content": "What is the capital of France?"}
],
"max_tokens": 50
}'
Step 5: Deploy a larger model (70B parameters)
For larger models, you’ll need multiple GPUs and more resources:
kubectl apply -f - <<EOF
apiVersion: v1
kind: Namespace
metadata:
name: llama-70b-demo
---
apiVersion: ome.io/v1beta1
kind: InferenceService
metadata:
name: llama-3-3-70b-instruct
namespace: llama-70b-demo
spec:
predictor:
model:
baseModel: llama-3-3-70b-instruct
protocolVersion: openAI
runtime: srt-llama-3-3-70b-instruct
minReplicas: 1
maxReplicas: 1
EOF
This configuration will:
- Use tensor parallelism across 4 GPUs (tp=4)
- Require ~160GB GPU memory
- Target H100/H200 GPU nodes
Step 6: Deploy a multi-node model (600B+ parameters)
For very large models like DeepSeek-R1, use multi-node deployment with RDMA:
kubectl apply -f - <<EOF
apiVersion: v1
kind: Namespace
metadata:
name: deepseek-r1
---
apiVersion: ome.io/v1beta1
kind: InferenceService
metadata:
name: deepseek-r1
namespace: deepseek-r1
annotations:
ome.io/deploymentMode: "MultiNode"
spec:
predictor:
model:
baseModel: deepseek-r1
protocolVersion: openAI
runtime: srt-multi-node-deepseek-r1-rdma
minReplicas: 1
maxReplicas: 1
EOF
This deployment features:
- Multi-node RDMA networking for optimal performance
- Support for 670B parameter models
- Specialized reasoning capabilities
- Requires cluster network nodes with RDMA support
Advanced Configuration Options
Custom Resource Requirements
Override the default resource requirements:
apiVersion: ome.io/v1beta1
kind: InferenceService
metadata:
name: custom-resources
spec:
predictor:
model:
baseModel: llama-3-2-3b-instruct
resources:
requests:
cpu: "16"
memory: 64Gi
nvidia.com/gpu: 1
limits:
cpu: "16"
memory: 64Gi
nvidia.com/gpu: 1
Environment Variables
Pass custom environment variables to the serving container:
spec:
predictor:
model:
baseModel: llama-3-2-1b-instruct
env:
- name: CUSTOM_SETTING
value: "production"
- name: LOG_LEVEL
value: "INFO"
Node Selection
Target specific node types:
spec:
predictor:
nodeSelector:
node.kubernetes.io/instance-type: BM.GPU.H100.8
tolerations:
- key: "nvidia.com/gpu"
operator: "Exists"
effect: "NoSchedule"
Monitoring and Debugging
Check Service Health
All OME serving runtimes include health endpoints:
# Basic health check
curl http://llama-3-2-1b-instruct.llama-1b-demo:8080/health
# Advanced health check (includes model loading status)
curl http://llama-3-2-1b-instruct.llama-1b-demo:8080/health_generate
View Metrics
OME exposes Prometheus metrics on port 8080:
curl http://llama-3-2-1b-instruct.llama-1b-demo:8080/metrics
Key metrics include:
sglang_prompt_tokens_total
- Total prompt tokens processedsglang_generation_tokens_total
- Total tokens generatedsglang_request_duration_seconds
- Request latency distributionsglang_concurrent_requests
- Current concurrent requests
Debug Common Issues
Pod won’t start:
kubectl describe pod -n llama-1b-demo <pod-name>
kubectl logs -n llama-1b-demo <pod-name> -c ome-container
Model loading fails:
# Check if base model exists
kubectl get clusterbasemodels
# Check serving runtime compatibility
kubectl describe clusterservingruntime srt-llama-3-2-1b-instruct
GPU resource issues:
# Check GPU allocation
kubectl describe node <gpu-node-name> | grep nvidia.com/gpu
# View GPU utilization
kubectl exec -it -n llama-1b-demo <pod-name> -- nvidia-smi
Supported Models and Runtimes
Small Models (1-8 GPUs)
- LLaMA 3.2 1B/3B: Single GPU deployment
- LLaMA 3.3 70B: 4-GPU tensor parallelism
- Mistral 7B: Single GPU with high throughput
- Mixtral 8x7B: Mixture of Experts architecture
Large Models (Multi-Node)
- DeepSeek-V3 (670B): Multi-node RDMA deployment
- DeepSeek-R1 (670B): Reasoning-optimized multi-node
- LLaMA 3.1 405B: FP8 quantized multi-node
Specialized Models
- E5-Mistral 7B: Text embedding generation
- LLaMA Vision: Multi-modal text and image processing
Performance Optimization
Tensor Parallelism
For multi-GPU models, OME automatically configures tensor parallelism:
- 1B models: tp=1 (single GPU)
- 3B models: tp=1 with memory optimization
- 70B models: tp=4 across 4 GPUs
- 400B+ models: Multi-node distribution
Memory Management
Configure memory fraction for optimal GPU utilization:
# Defined in serving runtime
args:
- |
python3 -m sglang.launch_server \
--mem-frac=0.9 \ # Use 90% of GPU memory
--model-path="$MODEL_PATH"
Compilation Optimization
Enable PyTorch compilation for better performance:
args:
- |
python3 -m sglang.launch_server \
--enable-torch-compile \
--torch-compile-max-bs 1 \
--model-path="$MODEL_PATH"
Next Steps
- Run Performance Benchmarks - Test your model’s performance
- Setup Autoscaling - Configure dynamic scaling
- Monitor with Prometheus - Set up comprehensive monitoring
- Deploy Multiple Models - Run multiple models efficiently
Cleanup
To remove the inference service:
kubectl delete inferenceservice -n llama-1b-demo llama-3-2-1b-instruct
kubectl delete inferenceservice -n llama-70b-demo llama-3-3-70b-instruct
kubectl delete inferenceservice -n deepseek-r1 deepseek-r1
This will clean up all associated resources including deployments, services, and storage.
Feedback
Was this page helpful?
Glad to hear it! Please tell us how we can improve.
Sorry to hear that. Please tell us how we can improve.