TechLead
Lesson 12 of 25
5 min read
Cloud & Kubernetes

Horizontal Pod Autoscaling

Configure Horizontal Pod Autoscalers (HPA) and Vertical Pod Autoscalers (VPA) for automatic scaling based on metrics

Autoscaling in Kubernetes

Kubernetes provides several autoscaling mechanisms to automatically adjust the resources allocated to your workloads based on demand. The most commonly used is the Horizontal Pod Autoscaler (HPA), which automatically scales the number of Pod replicas based on observed CPU utilization, memory usage, or custom metrics.

Types of Autoscaling

  • Horizontal Pod Autoscaler (HPA): Scales the number of Pod replicas up or down
  • Vertical Pod Autoscaler (VPA): Adjusts CPU and memory requests/limits for containers
  • Cluster Autoscaler: Scales the number of nodes in the cluster based on pending Pods
  • KEDA: Event-driven autoscaling for Kubernetes (scales on external metrics like queue length)

Horizontal Pod Autoscaler (HPA)

HPA automatically adjusts the number of Pod replicas in a Deployment, ReplicaSet, or StatefulSet. It queries the Metrics Server every 15 seconds by default and calculates the desired replica count based on the ratio of current metric value to target metric value.

Prerequisites

# Install Metrics Server (required for HPA)
# Most managed K8s services have this pre-installed
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml

# Verify Metrics Server is running
kubectl get deployment metrics-server -n kube-system
kubectl top nodes
kubectl top pods

HPA with CPU Metrics

# hpa-cpu.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-server-hpa
  namespace: production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-server
  minReplicas: 2
  maxReplicas: 20
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
      - type: Pods
        value: 4
        periodSeconds: 60
      - type: Percent
        value: 100
        periodSeconds: 60
      selectPolicy: Max
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
      - type: Pods
        value: 1
        periodSeconds: 120
      selectPolicy: Min

HPA with Multiple Metrics

# hpa-multi-metric.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: web-app-hpa
  namespace: production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web-app
  minReplicas: 3
  maxReplicas: 50
  metrics:
  # Scale based on CPU utilization
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 60
  # Scale based on memory utilization
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 75
  # Scale based on custom metric (requests per second)
  - type: Pods
    pods:
      metric:
        name: http_requests_per_second
      target:
        type: AverageValue
        averageValue: "1000"
  # Scale based on external metric (SQS queue length)
  - type: External
    external:
      metric:
        name: sqs_queue_length
        selector:
          matchLabels:
            queue: order-processing
      target:
        type: AverageValue
        averageValue: "5"

Imperative HPA Management

# Create an HPA imperatively
kubectl autoscale deployment api-server \
  --cpu-percent=70 \
  --min=2 \
  --max=20 \
  -n production

# View HPA status
kubectl get hpa -n production
kubectl describe hpa api-server-hpa -n production

# Watch HPA in action
kubectl get hpa -n production -w

# Generate load to test autoscaling
kubectl run load-generator --image=busybox --rm -it -- /bin/sh -c "while true; do wget -q -O- http://api-server; done"

# Delete HPA
kubectl delete hpa api-server-hpa -n production

Vertical Pod Autoscaler (VPA)

# vpa.yaml
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: api-server-vpa
  namespace: production
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-server
  updatePolicy:
    updateMode: "Auto"    # Off, Initial, Recreate, or Auto
  resourcePolicy:
    containerPolicies:
    - containerName: api
      minAllowed:
        cpu: 100m
        memory: 128Mi
      maxAllowed:
        cpu: 2
        memory: 2Gi
      controlledResources: ["cpu", "memory"]

Key Takeaways

  • HPA scales the number of Pod replicas based on CPU, memory, or custom metrics
  • Always set resource requests on containers — HPA uses them to calculate utilization
  • Use scaling behavior to control the speed of scale-up and scale-down
  • VPA adjusts resource requests/limits instead of replica count
  • Do not use HPA and VPA on the same metric simultaneously — they will conflict
  • Cluster Autoscaler works alongside HPA to add/remove nodes when needed

Continue Learning