Production-Ready GPU Inference Autoscaling on EKS with Karpenter, KEDA, and Dragonfly
This article presents a production-ready architecture for GPU inference autoscaling on Amazon EKS using Karpenter, KEDA, and Dragonfly. The solution enables scaling GPU workloads from zero, reduces cold start times, and optimizes costs through spot-first provisioning and P2P image distribution. The entire setup is GitOps-driven with ArgoCD and reproducible via Terraform.
- ▪The architecture reduces cold start times from 84 seconds to 7 seconds using Dragonfly for P2P image distribution.
- ▪It enables scale-to-zero capabilities, minimizing GPU spend during idle periods.
- ▪Spot-first provisioning with Karpenter helps cut costs while maintaining fast burst capacity for spiky workloads.
- ▪The system integrates with GitOps workflows using ArgoCD and infrastructure as code via Terraform.
- ▪Benchmark tests show the solution effectively handles load with predictable scaling and fast warm starts.
Opening excerpt (first ~120 words) tap to expand
try { if(localStorage) { let currentUser = localStorage.getItem('current_user'); if (currentUser) { currentUser = JSON.parse(currentUser); if (currentUser.id === 3935934) { document.getElementById('article-show-container').classList.add('current-user-is-article-author'); } } } } catch (e) { console.error(e); } Mark Johnson Posted on May 17 • Originally published at codingwithtaz.blog Production-Ready GPU Inference Autoscaling on EKS with Karpenter, KEDA, and Dragonfly #kubernetes #aws #devops #gpu TL;DR This architecture uses Karpenter + KEDA + Dragonfly on EKS to scale GPU inference pods from zero, pull model images quicker, and cut GPU spend with spot-first provisioning. Cold starts are 84s; warm starts are 7s (with small image).
…
Excerpt limited to ~120 words for fair-use compliance. The full article is at DEV.to (Top).