Autoscaling in Kubernetes
Kubernetes provides several autoscaling mechanisms to automatically adjust the resources allocated to your workloads based on demand. The most commonly used is the Horizontal Pod Autoscaler (HPA), which automatically scales the number of Pod replicas based on observed CPU utilization, memory usage, or custom metrics.
Types of Autoscaling
- Horizontal Pod Autoscaler (HPA): Scales the number of Pod replicas up or down
- Vertical Pod Autoscaler (VPA): Adjusts CPU and memory requests/limits for containers
- Cluster Autoscaler: Scales the number of nodes in the cluster based on pending Pods
- KEDA: Event-driven autoscaling for Kubernetes (scales on external metrics like queue length)
Horizontal Pod Autoscaler (HPA)
HPA automatically adjusts the number of Pod replicas in a Deployment, ReplicaSet, or StatefulSet. It queries the Metrics Server every 15 seconds by default and calculates the desired replica count based on the ratio of current metric value to target metric value.
Prerequisites
# Install Metrics Server (required for HPA)
# Most managed K8s services have this pre-installed
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml
# Verify Metrics Server is running
kubectl get deployment metrics-server -n kube-system
kubectl top nodes
kubectl top pods
HPA with CPU Metrics
# hpa-cpu.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: api-server-hpa
namespace: production
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: api-server
minReplicas: 2
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
behavior:
scaleUp:
stabilizationWindowSeconds: 60
policies:
- type: Pods
value: 4
periodSeconds: 60
- type: Percent
value: 100
periodSeconds: 60
selectPolicy: Max
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Pods
value: 1
periodSeconds: 120
selectPolicy: Min
HPA with Multiple Metrics
# hpa-multi-metric.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: web-app-hpa
namespace: production
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: web-app
minReplicas: 3
maxReplicas: 50
metrics:
# Scale based on CPU utilization
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 60
# Scale based on memory utilization
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 75
# Scale based on custom metric (requests per second)
- type: Pods
pods:
metric:
name: http_requests_per_second
target:
type: AverageValue
averageValue: "1000"
# Scale based on external metric (SQS queue length)
- type: External
external:
metric:
name: sqs_queue_length
selector:
matchLabels:
queue: order-processing
target:
type: AverageValue
averageValue: "5"
Imperative HPA Management
# Create an HPA imperatively
kubectl autoscale deployment api-server \
--cpu-percent=70 \
--min=2 \
--max=20 \
-n production
# View HPA status
kubectl get hpa -n production
kubectl describe hpa api-server-hpa -n production
# Watch HPA in action
kubectl get hpa -n production -w
# Generate load to test autoscaling
kubectl run load-generator --image=busybox --rm -it -- /bin/sh -c "while true; do wget -q -O- http://api-server; done"
# Delete HPA
kubectl delete hpa api-server-hpa -n production
Vertical Pod Autoscaler (VPA)
# vpa.yaml
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: api-server-vpa
namespace: production
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: api-server
updatePolicy:
updateMode: "Auto" # Off, Initial, Recreate, or Auto
resourcePolicy:
containerPolicies:
- containerName: api
minAllowed:
cpu: 100m
memory: 128Mi
maxAllowed:
cpu: 2
memory: 2Gi
controlledResources: ["cpu", "memory"]
Key Takeaways
- HPA scales the number of Pod replicas based on CPU, memory, or custom metrics
- Always set resource requests on containers — HPA uses them to calculate utilization
- Use scaling behavior to control the speed of scale-up and scale-down
- VPA adjusts resource requests/limits instead of replica count
- Do not use HPA and VPA on the same metric simultaneously — they will conflict
- Cluster Autoscaler works alongside HPA to add/remove nodes when needed