Is Karpenter production-ready?

Yes. Karpenter reached GA (v1.0) in late 2024 and is now the recommended node autoscaler for EKS. It is running in production at hundreds of companies including some of the largest AWS customers.

How much can Spot instances save vs On-Demand?

Typically 60–80% depending on instance type and availability zone. The risk is interruption — AWS gives a 2-minute notice before reclaiming a Spot instance. With Karpenter's disruption budgets and proper pod disruption budgets on your deployments, most stateless workloads handle this transparently.

Can I apply these optimisations to GKE or AKS too?

The principles are identical. GKE has Autopilot mode which handles much of this automatically. AKS has its own spot node pools and KEDA is cloud-agnostic. Karpenter now has providers for Azure and is in beta for GCP.

Kubernetes Cost Optimisation: How to Cut AWS Bills by 40% | CodeXOps

The Hidden Cost of Kubernetes

Kubernetes abstracts away servers so well that teams stop thinking about capacity. Nodes get provisioned for peak load and then run at 15% CPU utilisation 23 hours a day. We have audited over 30 EKS clusters in the last two years, and the average over-provisioning rate is 62%.

Step 1: Measure First

Install Kubecost (free tier covers most clusters) and let it run for two weeks. Look at the namespace-level cost allocation report — you will almost certainly find that 80% of cost comes from 20% of workloads. Focus there first.

Step 2: Right-Size Requests and Limits

Resource requests are the #1 driver of wasted spend. An application requesting 2 CPU cores but using 200m average will reserve (and be charged for) 2 cores even on an autoscaling cluster. Use the Goldilocks controller (from Fairwinds) to get VPA-based right-sizing recommendations for every deployment, then apply them incrementally with load testing.

Step 3: Node Autoscaling with Karpenter

Replace the Cluster Autoscaler with Karpenter. Karpenter provisions nodes in under 60 seconds (vs. 3–5 minutes), consolidates underutilised nodes automatically, and can mix instance types and Spot/On-Demand in a single NodePool. In our benchmarks, Karpenter reduces node count by 20–35% compared to Cluster Autoscaler on identical workloads.

Step 4: Spot Instances for Stateless Workloads

Most web services, batch jobs, and ML inference workloads are stateless and can tolerate a 2-minute interruption notice. Running them on Spot instances saves 60–80% on compute vs On-Demand. Use Karpenter's spec.disruption.budgets to ensure you never lose more than N% of replicas simultaneously.

Step 5: Vertical Scaling at Night

Implement KEDA (Kubernetes Event-Driven Autoscaler) with a cron scaler to scale down non-critical workloads outside business hours. A staging cluster that runs at full capacity 24/7 is a common oversight — scheduled scale-to-zero can cut staging costs by 70%.

Results Across Our Client Portfolio

Applying this playbook to 8 production EKS clusters in 2024: average cost reduction of 41%, ranging from 28% to 57% depending on initial over-provisioning levels. Median time to implement: 6 weeks.

Kubernetes Cost Optimisation: Cutting Our Clients' AWS Bills by 40%