Your AI model runs beautifully in the cloud — 50ms inference on an A100 GPU. Then you deploy it to a factory floor where the nearest data center is 200km away, and suddenly your “real-time” vision system has 800ms latency. The safety alert arrives after the incident.
Edge computing solves this by bringing compute to the data. But running Kubernetes at the edge isn't the same as running it in AWS. The hardware is constrained, the network is unreliable, and there's nobody on-site to SSH in and restart a pod. Here's what we learned deploying AI inference clusters across petrochemical plants, manufacturing lines, and logistics hubs.
“We deploy K3s clusters at petrochemical plants running YOLOv8 inference on NVIDIA Jetson devices. 15ms inference latency, self-healing pods, and zero-touch updates — all running on hardware that fits in your palm.”
— Sindika DevOps
Chapter 1: Why K3s, Not K8s
Full Kubernetes is a 300MB+ binary that expects reliable networking, abundant RAM, and a control plane that can talk to etcd without interruption. Edge environments have none of that. K3s — Rancher's lightweight Kubernetes distribution — strips Kubernetes down to a single 70MB binary that runs on ARM64 devices with 512MB of RAM.
K3s achieves this by replacing etcd with embedded SQLite, bundling the Flannel CNI directly into the binary, and removing cloud-provider-specific controllers. You get the full Kubernetes API surface — pods, deployments, services, ingress — without the operational weight.
K3s removes cloud dependencies and heavy components, making Kubernetes viable on ARM64 edge hardware with limited resources.
# Install K3s on a Jetson device — single command
curl -sfL https://get.k3s.io | \
INSTALL_K3S_EXEC="--disable=traefik --disable=servicelb" \
K3S_KUBECONFIG_MODE="644" \
sh -
# Verify it's running
sudo k3s kubectl get nodes
# NAME STATUS ROLES AGE VERSION
# jetson-01 Ready control-plane,master 30s v1.29.2+k3s1
# Add a worker node (another Jetson)
# On the worker:
curl -sfL https://get.k3s.io | \
K3S_URL=https://jetson-01:6443 \
K3S_TOKEN=$(cat /var/lib/rancher/k3s/server/node-token) \
sh -That's it. Two commands and you have a production-grade Kubernetes cluster running on hardware that costs $500. Compare that to the weeks of setup, the dedicated etcd cluster, and the $50K/month cloud bill for a managed K8s service.
Chapter 2: The Edge AI Architecture
An edge AI deployment isn't a single device — it's a fleet. We typically deploy K3s clusters across 5-50 physical sites, each running the same inference workloads with local customizations (camera feeds, model variants, site-specific thresholds). The architecture has three layers: cloud control plane, edge clusters, and inference pods.
Each site runs an identical K3s cluster with YOLO inference, RTSP video ingestion, and local data buffering. Fleet manages all sites from a single cloud pane.
✅ Architecture Design Decisions
- ✓Inference at the edge, training in the cloud — never train models on edge devices. Train in the cloud, export optimized models (TensorRT / ONNX), deploy to edge for inference only.
- ✓Send alerts, not video — streaming raw video to the cloud costs $500+/month/camera in bandwidth. Send only detections, bounding boxes, and thumbnail crops.
- ✓Each site is autonomous — if cloud connectivity drops, the edge cluster keeps running. Alerts are stored locally and synced when connectivity returns.
- ✓Identical clusters, site-specific config — the K3s manifests are identical across sites. Only ConfigMaps differ (camera URLs, alert thresholds, location metadata).
Chapter 3: The GPU Scheduling Problem
Running AI workloads on Kubernetes means GPU scheduling. At the edge, you typically have one GPU per device — not a pool of cloud GPUs. The NVIDIA Device Plugin for Kubernetes lets you request GPU resources in your pod spec, but the real challenge is sharing a single GPU across multiple inference workloads without one starving the others.
Multiple inference pods share a single Jetson GPU. Careful memory budgeting prevents OOM kills that would crash the entire inference pipeline.
# Pod spec requesting GPU + memory limits
apiVersion: v1
kind: Pod
metadata:
name: yolo-inference
labels:
app: vision-pipeline
spec:
containers:
- name: inference
image: registry.sindika.io/yolov8:v2.1-jetson
resources:
limits:
nvidia.com/gpu: 1 # Request GPU access
memory: "4Gi" # Cap container memory
requests:
memory: "2Gi"
cpu: "500m"
env:
- name: MODEL_PATH
value: /models/yolov8n-fp16.engine # TensorRT optimized
- name: CONFIDENCE_THRESHOLD
value: "0.45"
- name: MAX_BATCH_SIZE
value: "4" # Process 4 frames at once
volumeMounts:
- name: models
mountPath: /models
readOnly: true
- name: rtsp-config
mountPath: /config
livenessProbe:
httpGet:
path: /health/inference # Checks GPU health, not just HTTP
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
failureThreshold: 3 # 3 failures → restart pod
volumes:
- name: models
persistentVolumeClaim:
claimName: model-storage # Local NVMe, not network mount🤔 GPU Gotchas at the Edge
- ▸OOM kills are silent — a GPU out-of-memory crash doesn't always produce a log entry. Your liveness probe must actively test inference, not just check HTTP readiness.
- ▸Jetson unified memory — Jetson shares RAM between CPU and GPU. A 16GB Orin doesn't have 16GB for your model — the OS, K3s, and other pods need their share too.
- ▸TensorRT compilation is device-specific — a model compiled for Jetson Orin won't run on Jetson Xavier. Build TensorRT engines on the target hardware, not in CI.
- ▸Thermal throttling — Jetsons in enclosed industrial cases throttle at 80°C. Your 15ms inference becomes 40ms. Monitor temperature and design proper ventilation.
Chapter 4: Model Optimization for Edge
A YOLOv8-Large model runs at 50ms on an A100 but takes 200ms on a Jetson Orin. Edge deployment demands model optimization — not as an afterthought, but as a core part of your ML pipeline.
# Model optimization pipeline for Jetson deployment
# 1. Export PyTorch → ONNX
python export.py \
--weights yolov8n.pt \
--format onnx \
--simplify \
--input-shape 1,3,640,640
# 2. Convert ONNX → TensorRT (ON the Jetson device)
trtexec \
--onnx=yolov8n.onnx \
--saveEngine=yolov8n-fp16.engine \
--fp16 \ # Half-precision: 2x faster, ~1% accuracy loss
--workspace=4096 \ # 4GB workspace for optimization
--verbose
# 3. Benchmark on device
trtexec --loadEngine=yolov8n-fp16.engine --batch=4
# Results on Jetson Orin NX (16GB):
# ┌─────────────────────────────────┐
# │ Model │ FP32 │ FP16 │ INT8 │
# ├─────────────────────────────────┤
# │ YOLOv8n │ 22ms │ 12ms │ 8ms │
# │ YOLOv8s │ 38ms │ 18ms │ 13ms │
# │ YOLOv8m │ 85ms │ 42ms │ 28ms │
# │ YOLOv8l │ 200ms │ 95ms │ 62ms │
# └─────────────────────────────────┘✅ Edge Model Optimization Checklist
- ✓Use FP16 always — half-precision runs 2x faster on Jetson with negligible accuracy loss (~0.5-1% mAP drop). This is the single biggest win.
- ✓Use the smallest model that meets accuracy — YOLOv8n (nano) at 12ms beats YOLOv8l at 95ms if both detect your target objects reliably. Test on real data.
- ✓TensorRT is mandatory — raw PyTorch on Jetson is 5-10x slower than TensorRT. Always convert to
.enginefiles for production. - ✓Batch inference when possible — processing 4 frames at once is almost the same latency as 1 frame. Batch from multiple cameras for throughput.
- ✓Lower resolution if acceptable — 640×640 → 416×416 reduces compute by 40%. For many use cases (PPE detection, vehicle counting), lower resolution works fine.
Chapter 5: Cloud vs Edge — The Numbers
The cloud vs edge decision isn't religious — it's mathematical. Here's how the two approaches compare across the metrics that actually matter for production AI inference:
Cloud vs Edge Inference Comparison
| Metric | Cloud Inference | Edge Inference |
|---|---|---|
| Inference Latency | 50ms + 200ms network | ✓ 15ms local |
| Total TTFB | 250-800ms | ✓ 15-30ms |
| Bandwidth Cost | High (video streams) | ✓ Low (only alerts) |
| Offline Capable | No | ✓ Yes (full autonomy) |
| GPU Flexibility | ✓ Any cloud GPU | Fixed hardware |
| Model Size Limit | ✓ Unlimited | 8-16GB VRAM |
| Scale to N Sites | ✓ Easy (cloud scale) | N hardware purchases |
| Data Sovereignty | Data leaves site | ✓ Data stays on-prem |
The pattern is clear: edge wins on latency, bandwidth, offline capability, and data sovereignty. Cloud wins on flexibility, model size, and scaling. For real-time safety-critical applications (PPE detection, forklift alerts, anomaly monitoring), edge is the only viable choice. For batch processing or training, cloud is better.
“Our petrochemical client needed sub-50ms alert latency for safety compliance. Cloud inference gave them 800ms. Edge gave them 15ms. The decision was obvious — not because edge is always better, but because this use case demanded it.”
— Sindika DevOps
Chapter 6: Zero-Touch Updates with Fleet
You can't SSH into every factory floor device to update a model. Edge clusters need GitOps-driven updates that deploy automatically when you push to a repository. We use Fleet (by Rancher) to manage K3s clusters across all sites from a single control plane.
Push a new model version → CI builds the container → Fleet syncs to all sites → Auto-rollback if health checks fail.
# fleet.yaml — Deploy to all edge clusters
defaultNamespace: ai-inference
helm:
releaseName: yolo-inference
chart: ./charts/inference
values:
image:
repository: registry.sindika.io/yolov8
tag: v2.1-jetson # Updated via CI pipeline
model:
path: /models/yolov8n-fp16.engine
version: "2.1"
resources:
limits:
nvidia.com/gpu: 1
memory: "4Gi"
healthcheck:
inferenceTest: true # Run real inference in health check
targetCustomizations:
# Site-specific overrides
- name: plant-alpha
clusterSelector:
matchLabels:
location: plant-alpha
helm:
values:
cameras:
- rtsp://10.0.1.100/stream1
- rtsp://10.0.1.101/stream1
alerts:
confidenceThreshold: 0.55 # Higher threshold for noisy environment
- name: warehouse-beta
clusterSelector:
matchLabels:
location: warehouse-beta
helm:
values:
cameras:
- rtsp://10.0.2.50/stream1
alerts:
confidenceThreshold: 0.40 # Lower threshold for critical safety zoneThe pattern: push a new model version to your container registry, update the image tag in your GitOps repo, and Fleet rolls it out to all edge clusters with automatic rollback if health checks fail after update. No human needs to be on-site. No SSH. No manual intervention.
Chapter 7: Offline-First Design
Edge clusters will lose cloud connectivity. Not “might” — will. Industrial sites have unreliable WAN links. 4G connections drop during storms. Network maintenance windows happen during business hours. Your edge system must operate autonomously for hours, days, or even weeks.
Connected mode streams results to the cloud. Offline mode buffers locally. On reconnection, the buffer drains automatically with zero data loss.
# Offline-first event buffer — SQLite-backed queue
# config/buffer.yaml
buffer:
backend: sqlite
path: /data/event-buffer.db
maxSizeMB: 512 # 7 days of events at ~50 events/min
retentionDays: 14
sync:
cloudEndpoint: https://api.sindika.io/v1/events
batchSize: 100 # Send 100 events per request
retryPolicy:
initialDelay: 5s
maxDelay: 5m
backoffMultiplier: 2
onReconnect:
drainBuffer: true # Send all buffered events
prioritize: "alerts" # Send alerts before routine detections
maxDrainRate: 500/min # Don't overwhelm the API on reconnect✅ Offline Resilience Checklist
- ✓Local event buffer — SQLite queue holds 7+ days of events. No data loss even during extended outages.
- ✓Local alerting — critical alerts trigger local actions (sirens, displays, relay closures) without cloud dependency.
- ✓Model cache — models are stored on local NVMe. The system never downloads models at startup — only during scheduled update windows.
- ✓Image pre-pull — container images are pulled during maintenance windows. Deployments reference locally cached images, not remote registries.
- ✓Graceful reconnection — exponential backoff with jitter prevents thundering herd when all sites reconnect simultaneously after an outage.
Chapter 8: Monitoring a Fleet You Can't Touch
You can't walk up to a Jetson device in a petrochemical plant and check if it's running. You need proactive monitoring that tells you when something's wrong before a plant manager calls your support line.
# Key metrics to monitor on every edge node:
GPU Metrics:
- gpu_utilization_percent # Should be 30-80%
- gpu_memory_used_mb # Track for memory leaks
- gpu_temperature_celsius # Alert > 75°C, critical > 85°C
- inference_latency_p99_ms # Should be < 30ms for real-time
- inference_errors_total # Any non-zero = investigate
System Metrics:
- disk_usage_percent # Alert > 80% (logs, buffer filling up)
- memory_usage_percent # Track container memory leaks
- uptime_seconds # Detect unexpected reboots
- network_connectivity # boolean — is cloud reachable?
Application Metrics:
- detections_per_minute # Sudden drop = camera offline or model crash
- buffer_events_pending # Growing = connectivity issue
- last_cloud_sync_seconds # > 3600 = connectivity problem
- model_version # Must match expected versionThe most important alert is the absence-of-signal alert: if a node hasn't reported metrics in 10 minutes, something is fundamentally wrong — the device might be powered off, the network might be down, or K3s might have crashed. This single alert catches more issues than all other metrics combined.
“The edge isn't a second-class deployment target — it's where your AI models create the most value. The trick is making it feel as easy to deploy to a factory floor as it is to deploy to a cloud region. K3s, Fleet, and GitOps make that possible.”
— Sindika DevOps
The Bottom Line
Edge AI isn't about shrinking your cloud infrastructure — it's about bringing intelligence to where decisions happen. K3s makes Kubernetes viable on resource-constrained hardware. TensorRT makes inference fast. Fleet makes updates automatic. Offline-first design makes it reliable.
15ms inference. Self-healing clusters. Zero-touch deployments. Offline resilience. That's not a cloud demo — that's what we run in production on hardware that fits in your palm.