BlogCloud & DevOps
Cloud & DevOps

Kubernetes at the Edge: Deploying AI Workloads Closer to Data

Our approach to running inference models on edge clusters with K3s, minimizing latency for real-time vision processing.

Sindika DevOps Mar 1, 2026 8 min read

Your AI model runs beautifully in the cloud — 50ms inference on an A100 GPU. Then you deploy it to a factory floor where the nearest data center is 200km away, and suddenly your “real-time” vision system has 800ms latency. The safety alert arrives after the incident.

Edge computing solves this by bringing compute to the data. But running Kubernetes at the edge isn't the same as running it in AWS. The hardware is constrained, the network is unreliable, and there's nobody on-site to SSH in and restart a pod. Here's what we learned deploying AI inference clusters across petrochemical plants, manufacturing lines, and logistics hubs.

“We deploy K3s clusters at petrochemical plants running YOLOv8 inference on NVIDIA Jetson devices. 15ms inference latency, self-healing pods, and zero-touch updates — all running on hardware that fits in your palm.”

— Sindika DevOps

Chapter 1: Why K3s, Not K8s

Full Kubernetes is a 300MB+ binary that expects reliable networking, abundant RAM, and a control plane that can talk to etcd without interruption. Edge environments have none of that. K3s — Rancher's lightweight Kubernetes distribution — strips Kubernetes down to a single 70MB binary that runs on ARM64 devices with 512MB of RAM.

K3s achieves this by replacing etcd with embedded SQLite, bundling the Flannel CNI directly into the binary, and removing cloud-provider-specific controllers. You get the full Kubernetes API surface — pods, deployments, services, ingress — without the operational weight.

Full Kubernetes vs K3s at the EdgeFull Kubernetes💾 etcd cluster (3+ nodes)🔌 kube-apiserver📋 kube-scheduler⚙️ kube-controller-manager☁️ cloud-controller-manager📦 300MB+ binary🧮 2GB+ RAM minimumK3s (Edge-Ready) SQLite (embedded, single file) Single binary (all-in-one) ARM64 + x86 native No cloud dependencies Flannel CNI built-in 70MB binary 512MB RAM works

K3s removes cloud dependencies and heavy components, making Kubernetes viable on ARM64 edge hardware with limited resources.

# Install K3s on a Jetson device — single command
curl -sfL https://get.k3s.io | \
  INSTALL_K3S_EXEC="--disable=traefik --disable=servicelb" \
  K3S_KUBECONFIG_MODE="644" \
  sh -

# Verify it's running
sudo k3s kubectl get nodes
# NAME          STATUS   ROLES                  AGE   VERSION
# jetson-01     Ready    control-plane,master   30s   v1.29.2+k3s1

# Add a worker node (another Jetson)
# On the worker:
curl -sfL https://get.k3s.io | \
  K3S_URL=https://jetson-01:6443 \
  K3S_TOKEN=$(cat /var/lib/rancher/k3s/server/node-token) \
  sh -

That's it. Two commands and you have a production-grade Kubernetes cluster running on hardware that costs $500. Compare that to the weeks of setup, the dedicated etcd cluster, and the $50K/month cloud bill for a managed K8s service.

Chapter 2: The Edge AI Architecture

An edge AI deployment isn't a single device — it's a fleet. We typically deploy K3s clusters across 5-50 physical sites, each running the same inference workloads with local customizations (camera feeds, model variants, site-specific thresholds). The architecture has three layers: cloud control plane, edge clusters, and inference pods.

Edge AI Deployment Architecture☁️ Cloud / Data CenterGitLab CIRegistryFleet ManagerMonitoringWAN / 4G / 5G🏭 Site 1K3s ClusterYOLO InferenceRTSP IngestionData BufferJetson Orin / AGX🏭 Site 2K3s ClusterYOLO InferenceRTSP IngestionData BufferJetson Orin / AGX🏭 Site 3K3s ClusterYOLO InferenceRTSP IngestionData BufferJetson Orin / AGX

Each site runs an identical K3s cluster with YOLO inference, RTSP video ingestion, and local data buffering. Fleet manages all sites from a single cloud pane.

✅ Architecture Design Decisions

  • Inference at the edge, training in the cloud — never train models on edge devices. Train in the cloud, export optimized models (TensorRT / ONNX), deploy to edge for inference only.
  • Send alerts, not video — streaming raw video to the cloud costs $500+/month/camera in bandwidth. Send only detections, bounding boxes, and thumbnail crops.
  • Each site is autonomous — if cloud connectivity drops, the edge cluster keeps running. Alerts are stored locally and synced when connectivity returns.
  • Identical clusters, site-specific config — the K3s manifests are identical across sites. Only ConfigMaps differ (camera URLs, alert thresholds, location metadata).

Chapter 3: The GPU Scheduling Problem

Running AI workloads on Kubernetes means GPU scheduling. At the edge, you typically have one GPU per device — not a pool of cloud GPUs. The NVIDIA Device Plugin for Kubernetes lets you request GPU resources in your pod spec, but the real challenge is sharing a single GPU across multiple inference workloads without one starving the others.

GPU Resource SchedulingNVIDIA Jetson Orin — 2048 CUDA Cores, 16GB Unified Memorynvidia.com/gpu: 1 — Allocated via NVIDIA Device PluginYOLOv8 InferenceMem: 4.2 GBGPU: 65%OCR PipelineMem: 2.1 GBGPU: 20%Anomaly ModelMem: 1.8 GBGPU: 15%

Multiple inference pods share a single Jetson GPU. Careful memory budgeting prevents OOM kills that would crash the entire inference pipeline.

# Pod spec requesting GPU + memory limits
apiVersion: v1
kind: Pod
metadata:
  name: yolo-inference
  labels:
    app: vision-pipeline
spec:
  containers:
  - name: inference
    image: registry.sindika.io/yolov8:v2.1-jetson
    resources:
      limits:
        nvidia.com/gpu: 1         # Request GPU access
        memory: "4Gi"             # Cap container memory
      requests:
        memory: "2Gi"
        cpu: "500m"
    env:
    - name: MODEL_PATH
      value: /models/yolov8n-fp16.engine   # TensorRT optimized
    - name: CONFIDENCE_THRESHOLD
      value: "0.45"
    - name: MAX_BATCH_SIZE
      value: "4"                  # Process 4 frames at once
    volumeMounts:
    - name: models
      mountPath: /models
      readOnly: true
    - name: rtsp-config
      mountPath: /config
    livenessProbe:
      httpGet:
        path: /health/inference    # Checks GPU health, not just HTTP
        port: 8080
      initialDelaySeconds: 30
      periodSeconds: 10
      failureThreshold: 3         # 3 failures → restart pod
  volumes:
  - name: models
    persistentVolumeClaim:
      claimName: model-storage    # Local NVMe, not network mount

🤔 GPU Gotchas at the Edge

  • OOM kills are silent — a GPU out-of-memory crash doesn't always produce a log entry. Your liveness probe must actively test inference, not just check HTTP readiness.
  • Jetson unified memory — Jetson shares RAM between CPU and GPU. A 16GB Orin doesn't have 16GB for your model — the OS, K3s, and other pods need their share too.
  • TensorRT compilation is device-specific — a model compiled for Jetson Orin won't run on Jetson Xavier. Build TensorRT engines on the target hardware, not in CI.
  • Thermal throttling — Jetsons in enclosed industrial cases throttle at 80°C. Your 15ms inference becomes 40ms. Monitor temperature and design proper ventilation.

Chapter 4: Model Optimization for Edge

A YOLOv8-Large model runs at 50ms on an A100 but takes 200ms on a Jetson Orin. Edge deployment demands model optimization — not as an afterthought, but as a core part of your ML pipeline.

# Model optimization pipeline for Jetson deployment

# 1. Export PyTorch → ONNX
python export.py \
    --weights yolov8n.pt \
    --format onnx \
    --simplify \
    --input-shape 1,3,640,640

# 2. Convert ONNX → TensorRT (ON the Jetson device)
trtexec \
    --onnx=yolov8n.onnx \
    --saveEngine=yolov8n-fp16.engine \
    --fp16 \                   # Half-precision: 2x faster, ~1% accuracy loss
    --workspace=4096 \         # 4GB workspace for optimization
    --verbose

# 3. Benchmark on device
trtexec --loadEngine=yolov8n-fp16.engine --batch=4

# Results on Jetson Orin NX (16GB):
# ┌─────────────────────────────────┐
# │ Model      │ FP32  │ FP16  │ INT8  │
# ├─────────────────────────────────┤
# │ YOLOv8n    │ 22ms  │ 12ms  │ 8ms   │
# │ YOLOv8s    │ 38ms  │ 18ms  │ 13ms  │
# │ YOLOv8m    │ 85ms  │ 42ms  │ 28ms  │
# │ YOLOv8l    │ 200ms │ 95ms  │ 62ms  │
# └─────────────────────────────────┘

✅ Edge Model Optimization Checklist

  • Use FP16 always — half-precision runs 2x faster on Jetson with negligible accuracy loss (~0.5-1% mAP drop). This is the single biggest win.
  • Use the smallest model that meets accuracy — YOLOv8n (nano) at 12ms beats YOLOv8l at 95ms if both detect your target objects reliably. Test on real data.
  • TensorRT is mandatory — raw PyTorch on Jetson is 5-10x slower than TensorRT. Always convert to .engine files for production.
  • Batch inference when possible — processing 4 frames at once is almost the same latency as 1 frame. Batch from multiple cameras for throughput.
  • Lower resolution if acceptable — 640×640 → 416×416 reduces compute by 40%. For many use cases (PPE detection, vehicle counting), lower resolution works fine.

Chapter 5: Cloud vs Edge — The Numbers

The cloud vs edge decision isn't religious — it's mathematical. Here's how the two approaches compare across the metrics that actually matter for production AI inference:

Cloud vs Edge Inference Comparison

MetricCloud InferenceEdge Inference
Inference Latency50ms + 200ms network15ms local
Total TTFB250-800ms15-30ms
Bandwidth CostHigh (video streams)Low (only alerts)
Offline CapableNoYes (full autonomy)
GPU FlexibilityAny cloud GPUFixed hardware
Model Size LimitUnlimited8-16GB VRAM
Scale to N SitesEasy (cloud scale)N hardware purchases
Data SovereigntyData leaves siteData stays on-prem

The pattern is clear: edge wins on latency, bandwidth, offline capability, and data sovereignty. Cloud wins on flexibility, model size, and scaling. For real-time safety-critical applications (PPE detection, forklift alerts, anomaly monitoring), edge is the only viable choice. For batch processing or training, cloud is better.

“Our petrochemical client needed sub-50ms alert latency for safety compliance. Cloud inference gave them 800ms. Edge gave them 15ms. The decision was obvious — not because edge is always better, but because this use case demanded it.”

— Sindika DevOps

Chapter 6: Zero-Touch Updates with Fleet

You can't SSH into every factory floor device to update a model. Edge clusters need GitOps-driven updates that deploy automatically when you push to a repository. We use Fleet (by Rancher) to manage K3s clusters across all sites from a single control plane.

GitOps Fleet Update FlowGit PushModel v2.1📝CI BuildContainer image🔨Fleet SyncDetect change🔄Rolling UpdateAll sites🚀🛡️ Auto-rollback if health checks fail after updatePrevious version restored automatically — no manual intervention needed

Push a new model version → CI builds the container → Fleet syncs to all sites → Auto-rollback if health checks fail.

# fleet.yaml — Deploy to all edge clusters
defaultNamespace: ai-inference
helm:
  releaseName: yolo-inference
  chart: ./charts/inference
  values:
    image:
      repository: registry.sindika.io/yolov8
      tag: v2.1-jetson          # Updated via CI pipeline
    model:
      path: /models/yolov8n-fp16.engine
      version: "2.1"
    resources:
      limits:
        nvidia.com/gpu: 1
        memory: "4Gi"
    healthcheck:
      inferenceTest: true       # Run real inference in health check
      
targetCustomizations:
# Site-specific overrides
- name: plant-alpha
  clusterSelector:
    matchLabels:
      location: plant-alpha
  helm:
    values:
      cameras:
        - rtsp://10.0.1.100/stream1
        - rtsp://10.0.1.101/stream1
      alerts:
        confidenceThreshold: 0.55  # Higher threshold for noisy environment

- name: warehouse-beta
  clusterSelector:
    matchLabels:
      location: warehouse-beta
  helm:
    values:
      cameras:
        - rtsp://10.0.2.50/stream1
      alerts:
        confidenceThreshold: 0.40  # Lower threshold for critical safety zone

The pattern: push a new model version to your container registry, update the image tag in your GitOps repo, and Fleet rolls it out to all edge clusters with automatic rollback if health checks fail after update. No human needs to be on-site. No SSH. No manual intervention.

Chapter 7: Offline-First Design

Edge clusters will lose cloud connectivity. Not “might” — will. Industrial sites have unreliable WAN links. 4G connections drop during storms. Network maintenance windows happen during business hours. Your edge system must operate autonomously for hours, days, or even weeks.

Offline-First Edge Architecture🟢 Connected ModeStream results to cloudSync configs & modelsReport health metrics🟡 Offline ModeBuffer results locallyUse cached modelsContinue inference🔄 Reconnection StrategyExponential backoff → Reconnect → Drain buffer → Resume streamingZero data loss — local SQLite buffer holds up to 7 days of events

Connected mode streams results to the cloud. Offline mode buffers locally. On reconnection, the buffer drains automatically with zero data loss.

# Offline-first event buffer — SQLite-backed queue
# config/buffer.yaml
buffer:
  backend: sqlite
  path: /data/event-buffer.db
  maxSizeMB: 512              # 7 days of events at ~50 events/min
  retentionDays: 14
  
sync:
  cloudEndpoint: https://api.sindika.io/v1/events
  batchSize: 100              # Send 100 events per request
  retryPolicy:
    initialDelay: 5s
    maxDelay: 5m
    backoffMultiplier: 2
  
  onReconnect:
    drainBuffer: true         # Send all buffered events
    prioritize: "alerts"      # Send alerts before routine detections
    maxDrainRate: 500/min     # Don't overwhelm the API on reconnect

✅ Offline Resilience Checklist

  • Local event buffer — SQLite queue holds 7+ days of events. No data loss even during extended outages.
  • Local alerting — critical alerts trigger local actions (sirens, displays, relay closures) without cloud dependency.
  • Model cache — models are stored on local NVMe. The system never downloads models at startup — only during scheduled update windows.
  • Image pre-pull — container images are pulled during maintenance windows. Deployments reference locally cached images, not remote registries.
  • Graceful reconnection — exponential backoff with jitter prevents thundering herd when all sites reconnect simultaneously after an outage.

Chapter 8: Monitoring a Fleet You Can't Touch

You can't walk up to a Jetson device in a petrochemical plant and check if it's running. You need proactive monitoring that tells you when something's wrong before a plant manager calls your support line.

# Key metrics to monitor on every edge node:

GPU Metrics:
  - gpu_utilization_percent    # Should be 30-80%
  - gpu_memory_used_mb         # Track for memory leaks
  - gpu_temperature_celsius    # Alert > 75°C, critical > 85°C
  - inference_latency_p99_ms   # Should be < 30ms for real-time
  - inference_errors_total     # Any non-zero = investigate

System Metrics:
  - disk_usage_percent         # Alert > 80% (logs, buffer filling up)
  - memory_usage_percent       # Track container memory leaks
  - uptime_seconds             # Detect unexpected reboots
  - network_connectivity       # boolean — is cloud reachable?

Application Metrics:
  - detections_per_minute      # Sudden drop = camera offline or model crash
  - buffer_events_pending      # Growing = connectivity issue
  - last_cloud_sync_seconds    # > 3600 = connectivity problem
  - model_version              # Must match expected version

The most important alert is the absence-of-signal alert: if a node hasn't reported metrics in 10 minutes, something is fundamentally wrong — the device might be powered off, the network might be down, or K3s might have crashed. This single alert catches more issues than all other metrics combined.

“The edge isn't a second-class deployment target — it's where your AI models create the most value. The trick is making it feel as easy to deploy to a factory floor as it is to deploy to a cloud region. K3s, Fleet, and GitOps make that possible.”

— Sindika DevOps

The Bottom Line

Edge AI isn't about shrinking your cloud infrastructure — it's about bringing intelligence to where decisions happen. K3s makes Kubernetes viable on resource-constrained hardware. TensorRT makes inference fast. Fleet makes updates automatic. Offline-first design makes it reliable.

15ms inference. Self-healing clusters. Zero-touch deployments. Offline resilience. That's not a cloud demo — that's what we run in production on hardware that fits in your palm.