Cloud Cost Management
Cloud costs can spiral out of control without proper governance. FinOps (Financial Operations) is the practice of bringing financial accountability to the variable spend model of cloud. It combines systems, best practices, and culture to increase an organization's ability to understand cloud costs and make informed business decisions.
Cost Optimization Pillars
- Right-Sizing: Match instance types to actual workload requirements
- Pricing Models: Use Reserved Instances, Savings Plans, and Spot instances
- Autoscaling: Scale resources based on demand, not peak capacity
- Waste Elimination: Remove unused resources, orphaned volumes, idle load balancers
- Architecture Optimization: Use serverless, managed services, and efficient storage tiers
Right-Sizing Kubernetes Resources
One of the most impactful optimizations is ensuring your Pod resource requests and limits are accurately sized. Over-provisioning wastes money; under-provisioning causes performance issues and OOM kills.
# Check actual resource usage vs requests
kubectl top pods -n production
# View resource requests and limits for all pods
kubectl get pods -n production -o custom-columns='NAME:.metadata.name,CPU_REQ:.spec.containers[0].resources.requests.cpu,CPU_LIM:.spec.containers[0].resources.limits.cpu,MEM_REQ:.spec.containers[0].resources.requests.memory,MEM_LIM:.spec.containers[0].resources.limits.memory'
# Use Vertical Pod Autoscaler in recommendation mode
# to get right-sizing suggestions without auto-applying
# vpa-recommender.yaml
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: api-server-vpa
namespace: production
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: api-server
updatePolicy:
updateMode: "Off" # Recommendation only — does not auto-apply
Spot/Preemptible Instances
Spot instances (AWS), Preemptible VMs (GCP), and Spot VMs (Azure) offer 60-90% discounts over on-demand pricing. They can be interrupted with short notice, making them ideal for fault-tolerant, stateless workloads.
# eks-spot-nodegroup.yaml (AWS EKS managed node group with Spot)
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
name: my-cluster
region: us-east-1
managedNodeGroups:
- name: on-demand
instanceType: t3.medium
minSize: 2
maxSize: 4
desiredCapacity: 2
labels:
lifecycle: on-demand
- name: spot-workers
instanceTypes: ["t3.medium", "t3.large", "t3a.medium", "t3a.large"]
minSize: 0
maxSize: 20
desiredCapacity: 3
spot: true
labels:
lifecycle: spot
taints:
- key: spot
value: "true"
effect: PreferNoSchedule
# Schedule workloads on spot instances with tolerations
apiVersion: apps/v1
kind: Deployment
metadata:
name: worker-deployment
spec:
replicas: 10
template:
spec:
tolerations:
- key: spot
operator: Equal
value: "true"
effect: PreferNoSchedule
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 90
preference:
matchExpressions:
- key: lifecycle
operator: In
values: ["spot"]
containers:
- name: worker
image: myapp/worker:1.0
resources:
requests:
cpu: 250m
memory: 256Mi
Reserved Instances and Savings Plans
| Pricing Model | Discount | Commitment | Best For |
|---|---|---|---|
| On-Demand | 0% | None | Short-term, unpredictable workloads |
| Reserved (1yr) | ~40% | 1 year | Steady-state workloads |
| Reserved (3yr) | ~60% | 3 years | Long-term production workloads |
| Savings Plans | ~40-60% | 1-3 years | Flexible commitment ($/hr) |
| Spot/Preemptible | 60-90% | None | Fault-tolerant, stateless workloads |
Cost Monitoring Tools
# AWS Cost Explorer CLI
aws ce get-cost-and-usage \
--time-period Start=2026-03-01,End=2026-04-01 \
--granularity MONTHLY \
--metrics "BlendedCost" \
--group-by Type=DIMENSION,Key=SERVICE
# Install Kubecost for Kubernetes cost monitoring
helm repo add kubecost https://kubecost.github.io/cost-analyzer/
helm install kubecost kubecost/cost-analyzer \
--namespace kubecost \
--create-namespace
# Access Kubecost dashboard
kubectl port-forward svc/kubecost-cost-analyzer 9090:9090 -n kubecost
# Find unused/orphaned resources
# Unattached EBS volumes
aws ec2 describe-volumes --filters "Name=status,Values=available" \
--query "Volumes[*].{ID:VolumeId,Size:Size,Created:CreateTime}" --output table
# Unused Elastic IPs
aws ec2 describe-addresses --query "Addresses[?AssociationId==null]" --output table
# Idle load balancers (zero healthy targets)
aws elbv2 describe-target-health --target-group-arn <arn>
Cost Optimization Checklist
Quick Wins
- 1. Delete unused resources (orphaned volumes, idle LBs, stopped instances)
- 2. Right-size instances and Pod resource requests based on actual usage
- 3. Use Spot instances for stateless workloads (workers, CI/CD runners)
- 4. Enable autoscaling (HPA + Cluster Autoscaler) to match capacity to demand
- 5. Purchase Reserved Instances or Savings Plans for baseline workloads
- 6. Use appropriate storage classes (S3 Infrequent Access, Glacier for archives)
- 7. Set up billing alerts and cost budgets
- 8. Schedule non-production environments to shut down outside business hours