Microservicessystem_design~15 mins

Horizontal Pod Autoscaler in Microservices - Deep Dive

Choose your learning style10 modes available

Learn Why Deep Arch Practice Challenge Design Recall Scale

Start learning this pattern below

Jump into concepts and practice - no test required

Recommended

Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong

Overview - Horizontal Pod Autoscaler

What is it?

Horizontal Pod Autoscaler (HPA) is a system that automatically adjusts the number of running copies of a service (called pods) based on how busy they are. It watches metrics like CPU use or custom signals and adds or removes pods to keep the service running smoothly. This helps services handle changes in demand without manual intervention. It is commonly used in container orchestration platforms like Kubernetes.

Why it matters

Without HPA, services would either be overwhelmed during busy times or waste resources when demand is low. Manually scaling services is slow and error-prone, leading to poor user experience or high costs. HPA ensures services stay responsive and efficient by automatically matching resources to workload changes in real time.

Where it fits

Before learning HPA, you should understand containers, pods, and basic Kubernetes concepts like deployments and services. After mastering HPA, you can explore advanced scaling techniques like Vertical Pod Autoscaler, Cluster Autoscaler, and custom metrics for fine-tuned scaling.

Mental Model

Core Idea

Horizontal Pod Autoscaler automatically adjusts the number of service instances to match workload demand by monitoring resource usage or custom metrics.

Think of it like...

Imagine a restaurant that adds or removes tables based on how many customers arrive. When more people come in, the manager sets up more tables to serve them quickly. When fewer customers are present, some tables are removed to save space and staff effort.

┌─────────────────────────────┐
│       Horizontal Pod         │
│        Autoscaler            │
├─────────────┬───────────────┤
│  Metrics    │   Controller  │
│ (CPU, etc) │  (Decision)   │
├─────────────┴───────────────┤
│   Adjust Pod Count (Scale)  │
│  ┌───────────────┐          │
│  │ Pod Instances │◄─────────┤
│  └───────────────┘          │
└─────────────────────────────┘

Build-Up - 7 Steps

FoundationWhat is a Pod in Kubernetes

Concept: Introduce the basic unit of deployment called a pod, which runs one or more containers.

A pod is the smallest deployable unit in Kubernetes. It can contain one or more containers that share storage, network, and specifications. Pods are ephemeral and can be created or destroyed as needed.

Result

Understanding pods helps grasp what the Horizontal Pod Autoscaler scales — the number of these pod units.

Knowing pods as the basic building blocks clarifies what 'scaling pods' means in practice.

FoundationWhy Scale Pods Horizontally

IntermediateHow HPA Monitors Metrics

IntermediateHPA Scaling Algorithm Basics

IntermediateConfiguring HPA in Kubernetes

AdvancedUsing Custom Metrics for Scaling

ExpertHPA Interaction with Cluster Autoscaler

Under the Hood

HPA runs a control loop inside the Kubernetes control plane. It queries metrics APIs periodically, calculates desired pod count using a formula comparing current and target metrics, and updates the deployment's replica count. The Kubernetes scheduler then creates or removes pods accordingly. HPA supports multiple metric sources via the Metrics API and custom adapters.

Why designed this way?

HPA was designed to automate scaling in a cloud-native way, reducing manual effort and errors. Using a control loop with metrics allows reactive and adaptive scaling. The separation of pod scaling (HPA) and node scaling (Cluster Autoscaler) keeps concerns modular and manageable. Alternatives like manual scaling or fixed schedules were less flexible and efficient.

┌───────────────┐       ┌───────────────┐       ┌───────────────┐
│ Metrics API   │──────▶│ HPA Controller│──────▶│ Deployment    │
│ (CPU, Custom) │       │ (Control Loop)│       │ (Pod Count)   │
└───────────────┘       └───────────────┘       └───────────────┘
                                   │
                                   ▼
                          ┌─────────────────┐
                          │ Kubernetes       │
                          │ Scheduler & Node │
                          │ Management       │
                          └─────────────────┘

Myth Busters - 4 Common Misconceptions

Quick: Does HPA instantly add many pods when load spikes? Commit to yes or no.

Common Belief:HPA immediately adds a large number of pods as soon as load increases.

Tap to reveal reality

Quick: Can HPA scale pods even if the cluster has no free nodes? Commit to yes or no.

Common Belief:HPA can add pods regardless of cluster resource availability.

Tap to reveal reality

Quick: Does HPA only use CPU metrics for scaling? Commit to yes or no.

Common Belief:HPA only supports CPU utilization as a metric for scaling decisions.

Tap to reveal reality

Quick: Does HPA guarantee zero downtime during scaling? Commit to yes or no.

Common Belief:HPA ensures no downtime or request loss during scaling events.

Tap to reveal reality

Expert Zone

HPA's control loop interval and stabilization windows can be tuned to balance responsiveness and stability, which is often overlooked.

Custom metrics require careful design and reliable exposure to avoid scaling on noisy or stale data.

HPA does not scale stateful workloads well without additional coordination, as pod identity matters.

When NOT to use

HPA is not suitable for workloads that require vertical scaling (changing pod resources) or stateful applications needing fixed pod identities. In such cases, use Vertical Pod Autoscaler or StatefulSets with manual scaling.

Production Patterns

In production, HPA is combined with Cluster Autoscaler for full-stack scaling, uses custom metrics for business-driven scaling, and integrates with monitoring tools like Prometheus for metric collection and alerting.

Connections

Control Systems Engineering

HPA uses a feedback control loop similar to control systems that adjust outputs based on sensor inputs.

Understanding control loops in engineering helps grasp how HPA maintains desired performance by continuously adjusting pod counts.

Cloud Cost Optimization

HPA directly impacts cloud resource usage and costs by scaling pods to match demand.

Knowing HPA helps optimize cloud spending by avoiding over-provisioning and under-provisioning.

Restaurant Management

Like adjusting tables and staff based on customer flow, HPA adjusts pods based on workload.

This analogy clarifies the dynamic resource allocation concept in a familiar setting.

Common Pitfalls

#1Setting too low minimum pod count causing service unavailability during spikes.

Wrong approach:apiVersion: autoscaling/v2 kind: HorizontalPodAutoscaler metadata: name: myapp-hpa spec: scaleTargetRef: kind: Deployment name: myapp minReplicas: 1 maxReplicas: 10 metrics: - type: Resource resource: name: cpu target: type: Utilization averageUtilization: 50

Correct approach:apiVersion: autoscaling/v2 kind: HorizontalPodAutoscaler metadata: name: myapp-hpa spec: scaleTargetRef: kind: Deployment name: myapp minReplicas: 3 maxReplicas: 10 metrics: - type: Resource resource: name: cpu target: type: Utilization averageUtilization: 50

Root cause:Misunderstanding minimum pods needed to handle sudden load spikes leads to insufficient baseline capacity.

#2Using only CPU metric when application bottleneck is request queue length.

Wrong approach:metrics: - type: Resource resource: name: cpu target: type: Utilization averageUtilization: 60

Correct approach:metrics: - type: Pods pods: metric: name: queue_length target: type: AverageValue averageValue: 100

Root cause:Assuming CPU is always the best metric ignores application-specific performance indicators.

#3Expecting HPA to scale pods beyond cluster capacity without enabling Cluster Autoscaler.

Wrong approach:Deploy HPA without Cluster Autoscaler and no free nodes available.

Correct approach:Enable Cluster Autoscaler alongside HPA to add nodes when pods cannot be scheduled.

Root cause:Not considering infrastructure limits causes pods to remain pending and service degradation.

Key Takeaways

Horizontal Pod Autoscaler automatically adjusts the number of pods based on workload metrics to keep services responsive and efficient.

HPA uses a control loop that monitors metrics like CPU or custom signals and scales pods gradually to avoid instability.

It works best when combined with Cluster Autoscaler to ensure cluster resources match pod demands.

Custom metrics enable scaling based on meaningful business or application signals beyond system resource usage.

Understanding HPA's design and limits helps avoid common pitfalls and build reliable, cost-effective scalable systems.

Practice

(1/5)

1. What is the primary purpose of a Horizontal Pod Autoscaler in a Kubernetes microservices environment?

easy

A. Store persistent data for pods

B. Manually restart pods when they fail

C. Balance network traffic between pods

D. Automatically adjust the number of pods based on CPU or custom metrics

2. Which of the following is the correct YAML snippet to define a Horizontal Pod Autoscaler targeting CPU utilization at 50% for a deployment named web-app?

easy

A. apiVersion: autoscaling/v2\nkind: HorizontalPodAutoscaler\nmetadata:\n name: web-app-hpa\nspec:\n scaleTargetRef:\n apiVersion: apps/v1\n kind: Deployment\n name: web-app\n minReplicas: 1\n maxReplicas: 5\n metrics:\n - type: Resource\n resource:\n name: cpu\n target:\n type: Utilization\n averageUtilization: 70

B. apiVersion: v1\nkind: Pod\nmetadata:\n name: web-app\nspec:\n containers:\n - name: web-app\n image: web-app:latest

C. apiVersion: autoscaling/v1\nkind: HorizontalPodAutoscaler\nmetadata:\n name: web-app-hpa\nspec:\n scaleTargetRef:\n apiVersion: apps/v1\n kind: Deployment\n name: web-app\n minReplicas: 2\n maxReplicas: 10\n targetCPUUtilizationPercentage: 50

D. apiVersion: autoscaling/v2beta2\nkind: HorizontalPodAutoscaler\nmetadata:\n name: web-app-hpa\nspec:\n scaleTargetRef:\n apiVersion: apps/v1\n kind: Deployment\n name: web-app\n minReplicas: 1\n maxReplicas: 5\n metrics:\n - type: Resource\n resource:\n name: memory\n target:\n type: Utilization\n averageUtilization: 50

Horizontal Pod Autoscaler in Microservices - Deep Dive

Start learning this pattern below

Practice

Solution

Step 1: Understand the role of Horizontal Pod Autoscaler

Step 2: Compare options with this role

Final Answer:

Quick Check:

Solution

Step 1: Identify correct API version and fields for CPU target

Step 2: Check min/max replicas and target CPU utilization

Final Answer:

Quick Check:

Solution

Step 1: Understand scaling formula based on CPU utilization

Step 2: Round up and check min/max limits

Final Answer:

Quick Check:

Solution

Step 1: Check autoscaler dependency on metrics

Step 2: Understand effect of missing metrics

Final Answer:

Quick Check:

Solution

Step 1: Understand HPA multi-metric support

Step 2: Evaluate options for best practice

Final Answer:

Quick Check: