Model Agent Administration

Complete guide to Model Agent architecture, configuration, and operational management.

The Model Agent is the core component responsible for downloading, managing, and distributing models across your OME cluster. This guide provides comprehensive information for cluster administrators who need to configure, monitor, and troubleshoot the Model Agent in production environments.

Architecture Overview

DaemonSet Deployment

The Model Agent is deployed as a Kubernetes DaemonSet, ensuring it runs on every node in your cluster. This distributed architecture provides several benefits:

Parallel Downloads: Models are downloaded simultaneously across all selected nodes
High Availability: No single point of failure for model distribution
Local Storage: Models are stored locally on each node for optimal performance
Node-Specific Configuration: Each agent can be configured for the specific hardware and storage on its node

Model Agent Lifecycle

When you create a BaseModel resource, here’s the detailed workflow:

1. Resource Discovery

Kubernetes Informers: Each Model Agent uses Kubernetes informers to watch for BaseModel and ClusterBaseModel resources
Change Detection: The agent detects new models, updates to existing models, and model deletions in real-time
Event Processing: Changes are queued and processed asynchronously to prevent blocking

2. Node Selection Evaluation

Label Matching: The agent evaluates nodeSelector labels against the current node’s labels
Affinity Evaluation: Complex nodeAffinity rules are processed using Kubernetes’ standard affinity logic
Eligibility Decision: The agent determines if the current node should host this model

3. Task Creation and Queuing

The agent creates different types of tasks based on the operation:

Download Task: For new models that need to be downloaded
DownloadOverride Task: For existing models that need to be updated
Delete Task: For models that should be removed from the node

4. Download Execution

The download process varies by storage backend but follows this general pattern:

OCI Object Storage Downloads

Authentication: Establish connection using configured auth method (Instance Principal, User Principal, etc.)
Object Listing: List all objects under the specified prefix
Bulk Download: Download files concurrently with configurable parallelism
Verification: Verify file integrity using MD5 checksums
Atomic Placement: Move verified files to final destination

Hugging Face Downloads

Repository Analysis: Query the Hugging Face API for repository information
File Filtering: Determine which files are needed based on model format
LFS Handling: Handle Git LFS files seamlessly
Progressive Download: Download files with progress tracking
Cache Management: Manage local cache for efficiency

Authentication

The Model Agent supports flexible authentication for Hugging Face models:

Secret-based Authentication: Use Kubernetes secrets to store tokens
Parameter-based Authentication: Include tokens directly in model parameters
Custom Secret Key Names: Configure the secret key name (defaults to “token”)

Example with custom secret key:

spec:
  storage:
    storageUri: "hf://meta-llama/Llama-2-7b-hf"
    key: "hf-credentials"
    parameters:
      secretKey: "access-token"  # Custom key name in the secret

This allows you to store Hugging Face tokens in secrets with any key name, not just “token”.

5. Model Parsing and Analysis

After successful download, the agent performs comprehensive model analysis:

Configuration Parsing

config.json Analysis: Parse the model configuration file to extract metadata
Architecture Detection: Identify the model architecture using specialized parsers
Capability Inference: Determine model capabilities based on architecture and configuration

SafeTensors Analysis

For models using SafeTensors format:

Metadata Extraction: Read tensor metadata from SafeTensors headers
Parameter Counting: Calculate exact parameter counts from tensor shapes
Memory Estimation: Estimate memory requirements based on data types

6. Status Updates and Node Labeling

ConfigMap Updates: Update per-node ConfigMaps with model status
Node Labeling: Apply labels to nodes indicating model availability
Metric Emission: Update Prometheus metrics for monitoring

Configuration Reference

Command Line Arguments

The Model Agent supports extensive configuration through command-line arguments:

Download Configuration

Argument	Default	Description
`--download-retry`	3	Number of retry attempts for failed downloads
`--concurrency`	4	Number of concurrent file downloads per model
`--multipart-concurrency`	4	Number of concurrent chunks for large file downloads
`--num-download-worker`	5	Number of parallel download workers across all models
`--hf-max-workers`	4	Maximum concurrent workers for Hugging Face downloads
`--hf-max-retries`	10	Maximum retry attempts for Hugging Face API calls
`--hf-retry-interval`	15s	Base retry interval for Hugging Face API errors

Storage Configuration

Argument	Default	Description
`--models-root-dir`	`/mnt/models`	Root directory for storing models on nodes
`--temp-dir`	`/tmp/model-downloads`	Temporary directory for downloads
`--cleanup-temp`	true	Whether to clean up temporary files after download

Node and Cluster Configuration

Argument	Default	Description
`--node-name`	`$NODE_NAME`	Name of the current node (usually from environment)
`--namespace`	`ome`	Kubernetes namespace for ConfigMaps and status tracking
`--node-label-retry`	5	Number of retries for updating node labels

Logging and Monitoring

Argument	Default	Description
`--log-level`	`info`	Log verbosity (debug, info, warn, error)
`--log-format`	`text`	Log format (text, json)
`--port`	8080	HTTP port for health checks and metrics
`--metrics-port`	8080	Port for Prometheus metrics endpoint

Advanced Configuration

Argument	Default	Description
`--config-map-sync-interval`	`30s`	Interval for syncing ConfigMaps
`--model-watch-resync-period`	`10m`	Resync period for model watchers
`--max-concurrent-reconciles`	10	Maximum concurrent reconciliation operations

Environment Variables

The Model Agent also supports configuration through environment variables:

Variable	Description
`NODE_NAME`	Name of the current Kubernetes node
`POD_NAMESPACE`	Namespace where the Model Agent pod is running
`OCI_CONFIG_FILE`	Path to OCI configuration file
`HUGGINGFACE_TOKEN`	Default Hugging Face access token

Advanced Download Features

TensorRT-LLM Support and Shape Filtering

For TensorRT-LLM models, the Model Agent provides intelligent shape filtering:

GPU Shape Detection

Hardware Detection: The agent detects the GPU configuration on the current node
Shape Identification: Maps GPU hardware to TensorRT-LLM shape identifiers (e.g., GPU.A100.4, GPU.H100.8)
File Filtering: Only downloads model files that match the detected GPU shape

Shape Filtering Logic

// Simplified shape filtering logic
func (d *TensorRTLLMDownloader) filterByShape(files []string, nodeShape string) []string {
    var filtered []string
    for _, file := range files {
        if strings.Contains(file, nodeShape) || isShapeAgnostic(file) {
            filtered = append(filtered, file)
        }
    }
    return filtered
}

This optimization can save significant storage space and download time for large TensorRT-LLM models that may contain multiple GPU shape variants.

Concurrent Download Optimization

Bulk Download Strategy

For OCI Object Storage, the agent implements sophisticated bulk download:

Object Listing: List all objects under the model prefix
Size-Based Chunking: Split large files into chunks for parallel download
Connection Pooling: Reuse HTTP connections for efficiency
Rate Limiting: Respect storage backend rate limits

Multipart Download Logic

Large files (>200MB) are automatically split into chunks:

type MultipartDownload struct {
    URL        string
    ChunkSize  int64
    TotalSize  int64
    Chunks     []ChunkInfo
}

type ChunkInfo struct {
    Start  int64
    End    int64
    Status DownloadStatus
}

Resume Capability

The Model Agent supports resuming interrupted downloads:

Progress Tracking: Track download progress for each file
Partial File Detection: Detect partially downloaded files on restart
Range Requests: Use HTTP range requests to resume from last position
Integrity Verification: Verify resumed downloads maintain file integrity

Verification and Integrity

Comprehensive File Verification

Every downloaded file undergoes rigorous verification:

Size Verification

func verifyFileSize(localPath string, expectedSize int64) error {
    stat, err := os.Stat(localPath)
    if err != nil {
        return err
    }
    if stat.Size() != expectedSize {
        return fmt.Errorf("size mismatch: expected %d, got %d", expectedSize, stat.Size())
    }
    return nil
}

Checksum Verification

MD5 Checksums: Computed and verified against object storage metadata
SHA256 Support: For storage backends that provide SHA256 checksums
Custom Checksums: Support for vendor-specific checksum methods

Atomic Operations

Files are downloaded to temporary locations and only moved to final destinations after successful verification:

Temporary Download: Download to .tmp extension
Verification: Verify size and checksum
Atomic Move: Rename to final filename (atomic operation on most filesystems)
Cleanup: Remove temporary files on failure

Thread Safety and Concurrency

ConfigMap Coordination

The Model Agent uses sophisticated locking for thread-safe ConfigMap operations:

type ConfigMapManager struct {
    mutex     sync.RWMutex
    configMap *v1.ConfigMap
    client    kubernetes.Interface
}

func (cm *ConfigMapManager) UpdateModelStatus(modelKey string, status ModelStatus) error {
    cm.mutex.Lock()
    defer cm.mutex.Unlock()
    
    // Update ConfigMap data
    // Handle conflicts and retries
}

Model Update Handling

When an existing model is updated:

Change Detection: Deep comparison of model specifications
Status Transition: Set model status to “Updating”
Graceful Replacement: Download new version alongside existing
Verification: Verify new download before removing old version
Atomic Switch: Atomically replace old model with new version

Download Cancellation

The Model Agent supports graceful cancellation of ongoing downloads:

Active Download Tracking: Maintains a registry of all active downloads
Immediate Cancellation: When a model is deleted, any ongoing download is cancelled immediately
Context Propagation: Uses Go contexts to propagate cancellation throughout the download pipeline
Cleanup: Ensures partial downloads are cleaned up after cancellation

This prevents the issue where deleting a model resource would wait for the entire download to complete before deletion.

Note: For OCI Object Storage downloads, cancellation is best-effort as the underlying bulk download doesn’t support granular cancellation yet. However, Hugging Face downloads support immediate cancellation.

Worker Pool Management

The agent uses worker pools for concurrent operations:

type WorkerPool struct {
    workers    int
    taskQueue  chan Task
    resultChan chan Result
    ctx        context.Context
    cancel     context.CancelFunc
}

Monitoring and Observability

Health Check Endpoints

The Model Agent exposes several HTTP endpoints for monitoring:

Basic Health Check (`/healthz`)

func (h *HealthHandler) ServeHTTP(w http.ResponseWriter, r *http.Request) {
    // Check if models root directory is accessible
    if _, err := os.Stat(h.modelsRootDir); err != nil {
        http.Error(w, "Models directory not accessible", http.StatusServiceUnavailable)
        return
    }
    
    w.WriteHeader(http.StatusOK)
    w.Write([]byte("OK"))
}

Readiness Check (`/readyz`)

Checks if the agent is ready to process models:

Kubernetes API connectivity
Storage backend accessibility
Required directories exist

Liveness Check (`/livez`)

Verifies the agent is functioning correctly:

Worker pool status
Recent operation success
Memory usage within limits

Comprehensive Metrics

The Model Agent provides detailed Prometheus metrics:

Download Metrics

# Total successful downloads
model_agent_downloads_success_total{model_type="llama", namespace="default", name="llama-70b"} 1

# Total failed downloads  
model_agent_downloads_failed_total{model_type="llama", namespace="default", name="llama-70b"} 0

# Download duration
model_agent_download_duration_seconds{model_type="llama", namespace="default", name="llama-70b"} 1234.56

# Download size in bytes
model_agent_download_bytes_total{model_type="llama", namespace="default", name="llama-70b"} 140737488355328

Verification Metrics

# Verification results
model_agent_verifications_total{model_type="llama", namespace="default", name="llama-70b", result="success"} 1

# Verification duration
model_agent_verification_duration_seconds 12.34

# MD5 checksum failures
model_agent_md5_checksum_failed_total{model_type="llama", namespace="default", name="llama-70b"} 0

Runtime Metrics

# Current goroutines
go_goroutines_current 45

# Memory allocation
go_memory_alloc_bytes 67108864

# GC pause time
go_gc_pause_duration_seconds_custom 0.001234

Agent-Specific Metrics

# Active download workers
model_agent_active_workers 3

# Queue depth
model_agent_task_queue_depth 2

# ConfigMap update operations
model_agent_configmap_operations_total{operation="update", result="success"} 15

Rate Limiting Protection

The Model Agent includes sophisticated rate limiting protection for Hugging Face API:

Automatic Backoff

Exponential Backoff: Automatically increases wait time between retries
Jitter: Adds randomness to prevent thundering herd
Retry-After: Respects server-provided retry delays
Max Retries: Configurable retry limit (default: 10)

Staggered Start

When multiple agents start simultaneously (e.g., after cluster restart), they automatically stagger their initialization:

Each node gets a deterministic delay based on its name (0-30 seconds)
Prevents all agents from hitting the API at once
Reduces initial rate limiting issues

Best Practices for Large Clusters

Limit concurrent downloads: Use fewer download workers for large clusters
Increase retry intervals: Set longer base retry intervals
Monitor rate limits: Watch for 429 errors in logs
Use regional endpoints: Consider using region-specific Hugging Face endpoints

Example configuration for large clusters:

args:
- --hf-max-workers=2
- --hf-max-retries=15
- --hf-retry-interval=30s
- --num-download-worker=3

Troubleshooting Guide

Common Issues and Solutions

Downloads Fail with Permission Errors

Symptoms:

HTTP 403 Forbidden errors
Authentication failures in logs

Diagnosis:

# Check OCI authentication
kubectl exec -it <model-agent-pod> -- oci iam user get --user-id <user-ocid>

# Check Hugging Face token
kubectl get secret <hf-token-secret> -o yaml

Solutions:

Verify OCI Instance Principal permissions
Check Hugging Face token validity
Ensure secrets are properly mounted

Model Parsing Failures

Symptoms:

Models download but status shows “Failed”
Parsing errors in agent logs

Diagnosis:

# Check model directory contents
kubectl exec -it <model-agent-pod> -- ls -la /mnt/models/<model-name>/

# Verify config.json exists and is valid
kubectl exec -it <model-agent-pod> -- cat /mnt/models/<model-name>/config.json | jq .

Solutions:

Verify model directory structure
Check if config.json is valid JSON
Consider disabling auto-parsing with annotation

Storage Space Issues

Symptoms:

Downloads fail with “no space left on device”
Node storage metrics show high usage

Diagnosis:

# Check node storage usage
kubectl exec -it <model-agent-pod> -- df -h

# Check model directory sizes
kubectl exec -it <model-agent-pod> -- du -sh /mnt/models/*

Solutions:

Increase node storage capacity
Implement model cleanup policies
Use node affinity to target nodes with sufficient storage

Performance Issues

Symptoms:

Slow download speeds
High memory usage
Agent pod restarts

Diagnosis:

# Check resource usage
kubectl top pod <model-agent-pod>

# Check agent configuration
kubectl describe pod <model-agent-pod>

# Review agent metrics
curl http://<model-agent-pod>:8080/metrics | grep model_agent

Solutions:

Adjust concurrency settings
Increase resource limits
Optimize storage backend configuration

Debug Mode

Enable debug logging for detailed troubleshooting:

spec:
  containers:
  - name: model-agent
    args:
    - --log-level=debug
    - --log-format=json

Log Analysis

Key log patterns to monitor:

# Download progress
grep "downloading file" /var/log/model-agent.log

# Verification results
grep "verification" /var/log/model-agent.log

# Configuration updates
grep "configmap update" /var/log/model-agent.log

# Error patterns
grep "ERROR\|Failed\|Error" /var/log/model-agent.log

Production Best Practices

Resource Planning

Memory Requirements

Base Memory: 256Mi minimum for agent operations
Download Buffers: Additional 1-2Gi for concurrent downloads
Model Parsing: 512Mi-1Gi for parsing large model configurations

Storage Planning

Local Storage: Use fast local storage (NVMe SSDs) for model paths
Capacity Planning: Plan for 2-3x model size for download + verification
Cleanup Policies: Implement automated cleanup for old model versions

Network Bandwidth

Download Bandwidth: Ensure sufficient bandwidth for multiple concurrent model downloads
Egress Costs: Consider egress costs for cloud storage downloads
Regional Placement: Place agents in the same region as storage when possible

Security Configuration

RBAC Requirements

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: model-agent
rules:
- apiGroups: [""]
  resources: ["configmaps"]
  verbs: ["get", "list", "watch", "create", "update", "patch"]
- apiGroups: [""]
  resources: ["nodes"]
  verbs: ["get", "list", "watch", "patch"]
- apiGroups: ["ome.io"]
  resources: ["basemodels", "clusterbasemodels"]
  verbs: ["get", "list", "watch"]

Secret Management

Use least-privilege service accounts
Rotate credentials regularly
Implement secret scanning and monitoring

Monitoring Setup

Alerting Rules

groups:
- name: model-agent
  rules:
  - alert: ModelDownloadFailure
    expr: increase(model_agent_downloads_failed_total[5m]) > 0
    for: 2m
    annotations:
      summary: "Model download failure detected"
      
  - alert: ModelAgentDown
    expr: up{job="model-agent"} == 0
    for: 1m
    annotations:
      summary: "Model Agent is down"

Dashboard Metrics

Key metrics to dashboard:

Download success/failure rates
Download duration and throughput
Storage usage per node
Agent resource utilization
ConfigMap update frequency

This comprehensive guide provides the operational knowledge needed to effectively manage the Model Agent in production OME deployments.

Feedback

Was this page helpful?

Glad to hear it! Please tell us how we can improve.

Sorry to hear that. Please tell us how we can improve.

Last modified July 1, 2025: [bugfix] fix hf token fetch and deletion logic (#122) (52edf70)

Model Agent Administration

Architecture Overview

DaemonSet Deployment

Model Agent Lifecycle

1. Resource Discovery

2. Node Selection Evaluation

3. Task Creation and Queuing

4. Download Execution

OCI Object Storage Downloads

Hugging Face Downloads

Authentication

5. Model Parsing and Analysis

Configuration Parsing

SafeTensors Analysis

6. Status Updates and Node Labeling

Configuration Reference

Command Line Arguments

Download Configuration

Storage Configuration

Node and Cluster Configuration

Logging and Monitoring

Advanced Configuration

Environment Variables

Advanced Download Features

TensorRT-LLM Support and Shape Filtering

GPU Shape Detection

Shape Filtering Logic

Concurrent Download Optimization

Bulk Download Strategy

Multipart Download Logic

Resume Capability

Verification and Integrity

Comprehensive File Verification

Size Verification

Checksum Verification

Atomic Operations

Thread Safety and Concurrency

ConfigMap Coordination

Model Update Handling

Download Cancellation

Worker Pool Management

Monitoring and Observability

Health Check Endpoints

Basic Health Check (/healthz)

Readiness Check (/readyz)

Liveness Check (/livez)

Comprehensive Metrics

Download Metrics

Verification Metrics

Runtime Metrics

Agent-Specific Metrics

Rate Limiting Protection

Automatic Backoff

Staggered Start

Best Practices for Large Clusters

Troubleshooting Guide

Common Issues and Solutions

Downloads Fail with Permission Errors

Model Parsing Failures

Storage Space Issues

Performance Issues

Debug Mode

Log Analysis

Production Best Practices

Resource Planning

Memory Requirements

Storage Planning

Network Bandwidth

Security Configuration

RBAC Requirements

Secret Management

Monitoring Setup

Alerting Rules

Dashboard Metrics

Feedback

Basic Health Check (`/healthz`)

Readiness Check (`/readyz`)

Liveness Check (`/livez`)