Bird
Raised Fist0
Microservicessystem_design~15 mins

Horizontal Pod Autoscaler in Microservices - Deep Dive

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Overview - Horizontal Pod Autoscaler
What is it?
Horizontal Pod Autoscaler (HPA) is a system that automatically adjusts the number of running copies of a service (called pods) based on how busy they are. It watches metrics like CPU use or custom signals and adds or removes pods to keep the service running smoothly. This helps services handle changes in demand without manual intervention. It is commonly used in container orchestration platforms like Kubernetes.
Why it matters
Without HPA, services would either be overwhelmed during busy times or waste resources when demand is low. Manually scaling services is slow and error-prone, leading to poor user experience or high costs. HPA ensures services stay responsive and efficient by automatically matching resources to workload changes in real time.
Where it fits
Before learning HPA, you should understand containers, pods, and basic Kubernetes concepts like deployments and services. After mastering HPA, you can explore advanced scaling techniques like Vertical Pod Autoscaler, Cluster Autoscaler, and custom metrics for fine-tuned scaling.
Mental Model
Core Idea
Horizontal Pod Autoscaler automatically adjusts the number of service instances to match workload demand by monitoring resource usage or custom metrics.
Think of it like...
Imagine a restaurant that adds or removes tables based on how many customers arrive. When more people come in, the manager sets up more tables to serve them quickly. When fewer customers are present, some tables are removed to save space and staff effort.
┌─────────────────────────────┐
│       Horizontal Pod         │
│        Autoscaler            │
├─────────────┬───────────────┤
│  Metrics    │   Controller  │
│ (CPU, etc) │  (Decision)   │
├─────────────┴───────────────┤
│   Adjust Pod Count (Scale)  │
│  ┌───────────────┐          │
│  │ Pod Instances │◄─────────┤
│  └───────────────┘          │
└─────────────────────────────┘
Build-Up - 7 Steps
1
FoundationWhat is a Pod in Kubernetes
🤔
Concept: Introduce the basic unit of deployment called a pod, which runs one or more containers.
A pod is the smallest deployable unit in Kubernetes. It can contain one or more containers that share storage, network, and specifications. Pods are ephemeral and can be created or destroyed as needed.
Result
Understanding pods helps grasp what the Horizontal Pod Autoscaler scales — the number of these pod units.
Knowing pods as the basic building blocks clarifies what 'scaling pods' means in practice.
2
FoundationWhy Scale Pods Horizontally
🤔
Concept: Explain the need to increase or decrease pod count to handle varying workloads.
When a service gets more requests, a single pod may not handle all traffic efficiently. Adding more pods spreads the load, improving performance. Conversely, reducing pods saves resources when demand is low.
Result
Learners see the practical reason for horizontal scaling: balancing performance and cost.
Understanding the trade-off between resource use and responsiveness motivates autoscaling.
3
IntermediateHow HPA Monitors Metrics
🤔Before reading on: do you think HPA only uses CPU usage to decide scaling, or can it use other metrics too? Commit to your answer.
Concept: HPA watches resource metrics like CPU or memory, and can also use custom metrics to decide when to scale pods.
HPA periodically checks metrics from pods or external sources. The default is CPU utilization, but it can be configured to use memory or custom application metrics. This flexibility allows scaling based on what matters most for the service.
Result
Learners understand that HPA is not limited to one metric but can adapt to different workload signals.
Knowing HPA's metric flexibility enables designing smarter scaling strategies tailored to service needs.
4
IntermediateHPA Scaling Algorithm Basics
🤔Before reading on: do you think HPA instantly adds many pods when load spikes, or scales gradually? Commit to your answer.
Concept: HPA uses a control loop that compares current metrics to target values and adjusts pod count gradually to avoid instability.
HPA calculates desired pod count by dividing current metric value by target metric value, then adjusts pods smoothly. It avoids sudden large changes to prevent thrashing (rapid scaling up and down).
Result
Learners see how HPA balances responsiveness with stability in scaling decisions.
Understanding gradual scaling prevents expecting instant reactions and helps design stable systems.
5
IntermediateConfiguring HPA in Kubernetes
🤔
Concept: Show how to define HPA using Kubernetes YAML manifests specifying target metrics and pod limits.
An HPA resource includes the target deployment, minimum and maximum pod counts, and metric targets. For example, setting CPU utilization target at 50% with min 2 and max 10 pods. Kubernetes then manages pod count automatically.
Result
Learners can create and customize HPA resources to control scaling behavior.
Knowing configuration options empowers learners to tailor autoscaling to their applications.
6
AdvancedUsing Custom Metrics for Scaling
🤔Before reading on: do you think HPA can scale based on business metrics like request rate, or only system metrics? Commit to your answer.
Concept: HPA supports custom metrics, allowing scaling based on application-specific signals like request count or queue length.
By integrating with metrics adapters, HPA can use any metric exposed by the application or monitoring system. This enables scaling on meaningful business or performance indicators beyond CPU or memory.
Result
Learners appreciate how to implement smarter, context-aware autoscaling.
Understanding custom metrics unlocks advanced scaling strategies aligned with real business needs.
7
ExpertHPA Interaction with Cluster Autoscaler
🤔Before reading on: do you think HPA can add pods even if the cluster has no free nodes? Commit to your answer.
Concept: HPA scales pods, but if the cluster lacks resources, Cluster Autoscaler can add nodes to accommodate new pods, working together for full scaling.
HPA increases pod count based on metrics, but pods need nodes to run. If no nodes are free, Cluster Autoscaler adds nodes automatically. This coordination ensures scaling works end-to-end from workload to infrastructure.
Result
Learners understand the layered scaling system in Kubernetes and how HPA fits into it.
Knowing HPA's limits and its cooperation with Cluster Autoscaler prevents surprises in scaling behavior.
Under the Hood
HPA runs a control loop inside the Kubernetes control plane. It queries metrics APIs periodically, calculates desired pod count using a formula comparing current and target metrics, and updates the deployment's replica count. The Kubernetes scheduler then creates or removes pods accordingly. HPA supports multiple metric sources via the Metrics API and custom adapters.
Why designed this way?
HPA was designed to automate scaling in a cloud-native way, reducing manual effort and errors. Using a control loop with metrics allows reactive and adaptive scaling. The separation of pod scaling (HPA) and node scaling (Cluster Autoscaler) keeps concerns modular and manageable. Alternatives like manual scaling or fixed schedules were less flexible and efficient.
┌───────────────┐       ┌───────────────┐       ┌───────────────┐
│ Metrics API   │──────▶│ HPA Controller│──────▶│ Deployment    │
│ (CPU, Custom) │       │ (Control Loop)│       │ (Pod Count)   │
└───────────────┘       └───────────────┘       └───────────────┘
                                   │
                                   ▼
                          ┌─────────────────┐
                          │ Kubernetes       │
                          │ Scheduler & Node │
                          │ Management       │
                          └─────────────────┘
Myth Busters - 4 Common Misconceptions
Quick: Does HPA instantly add many pods when load spikes? Commit to yes or no.
Common Belief:HPA immediately adds a large number of pods as soon as load increases.
Tap to reveal reality
Reality:HPA scales pods gradually to avoid instability and thrashing, not instantly.
Why it matters:Expecting instant scaling can lead to misjudging system responsiveness and cause poor tuning decisions.
Quick: Can HPA scale pods even if the cluster has no free nodes? Commit to yes or no.
Common Belief:HPA can add pods regardless of cluster resource availability.
Tap to reveal reality
Reality:HPA can request more pods, but if no nodes are free, pods remain pending until Cluster Autoscaler adds nodes or resources free up.
Why it matters:Ignoring cluster capacity can cause pods to stay unscheduled, leading to service degradation.
Quick: Does HPA only use CPU metrics for scaling? Commit to yes or no.
Common Belief:HPA only supports CPU utilization as a metric for scaling decisions.
Tap to reveal reality
Reality:HPA supports multiple metrics including memory and custom application metrics via adapters.
Why it matters:Limiting scaling to CPU can miss important workload signals, causing inefficient scaling.
Quick: Does HPA guarantee zero downtime during scaling? Commit to yes or no.
Common Belief:HPA ensures no downtime or request loss during scaling events.
Tap to reveal reality
Reality:Scaling can cause brief disruptions due to pod startup time or termination delays; additional strategies are needed for zero downtime.
Why it matters:Assuming perfect uptime can lead to insufficient readiness and liveness checks, causing user impact.
Expert Zone
1
HPA's control loop interval and stabilization windows can be tuned to balance responsiveness and stability, which is often overlooked.
2
Custom metrics require careful design and reliable exposure to avoid scaling on noisy or stale data.
3
HPA does not scale stateful workloads well without additional coordination, as pod identity matters.
When NOT to use
HPA is not suitable for workloads that require vertical scaling (changing pod resources) or stateful applications needing fixed pod identities. In such cases, use Vertical Pod Autoscaler or StatefulSets with manual scaling.
Production Patterns
In production, HPA is combined with Cluster Autoscaler for full-stack scaling, uses custom metrics for business-driven scaling, and integrates with monitoring tools like Prometheus for metric collection and alerting.
Connections
Control Systems Engineering
HPA uses a feedback control loop similar to control systems that adjust outputs based on sensor inputs.
Understanding control loops in engineering helps grasp how HPA maintains desired performance by continuously adjusting pod counts.
Cloud Cost Optimization
HPA directly impacts cloud resource usage and costs by scaling pods to match demand.
Knowing HPA helps optimize cloud spending by avoiding over-provisioning and under-provisioning.
Restaurant Management
Like adjusting tables and staff based on customer flow, HPA adjusts pods based on workload.
This analogy clarifies the dynamic resource allocation concept in a familiar setting.
Common Pitfalls
#1Setting too low minimum pod count causing service unavailability during spikes.
Wrong approach:apiVersion: autoscaling/v2 kind: HorizontalPodAutoscaler metadata: name: myapp-hpa spec: scaleTargetRef: kind: Deployment name: myapp minReplicas: 1 maxReplicas: 10 metrics: - type: Resource resource: name: cpu target: type: Utilization averageUtilization: 50
Correct approach:apiVersion: autoscaling/v2 kind: HorizontalPodAutoscaler metadata: name: myapp-hpa spec: scaleTargetRef: kind: Deployment name: myapp minReplicas: 3 maxReplicas: 10 metrics: - type: Resource resource: name: cpu target: type: Utilization averageUtilization: 50
Root cause:Misunderstanding minimum pods needed to handle sudden load spikes leads to insufficient baseline capacity.
#2Using only CPU metric when application bottleneck is request queue length.
Wrong approach:metrics: - type: Resource resource: name: cpu target: type: Utilization averageUtilization: 60
Correct approach:metrics: - type: Pods pods: metric: name: queue_length target: type: AverageValue averageValue: 100
Root cause:Assuming CPU is always the best metric ignores application-specific performance indicators.
#3Expecting HPA to scale pods beyond cluster capacity without enabling Cluster Autoscaler.
Wrong approach:Deploy HPA without Cluster Autoscaler and no free nodes available.
Correct approach:Enable Cluster Autoscaler alongside HPA to add nodes when pods cannot be scheduled.
Root cause:Not considering infrastructure limits causes pods to remain pending and service degradation.
Key Takeaways
Horizontal Pod Autoscaler automatically adjusts the number of pods based on workload metrics to keep services responsive and efficient.
HPA uses a control loop that monitors metrics like CPU or custom signals and scales pods gradually to avoid instability.
It works best when combined with Cluster Autoscaler to ensure cluster resources match pod demands.
Custom metrics enable scaling based on meaningful business or application signals beyond system resource usage.
Understanding HPA's design and limits helps avoid common pitfalls and build reliable, cost-effective scalable systems.

Practice

(1/5)
1. What is the primary purpose of a Horizontal Pod Autoscaler in a Kubernetes microservices environment?
easy
A. Store persistent data for pods
B. Manually restart pods when they fail
C. Balance network traffic between pods
D. Automatically adjust the number of pods based on CPU or custom metrics

Solution

  1. Step 1: Understand the role of Horizontal Pod Autoscaler

    It is designed to monitor resource usage like CPU or custom metrics and adjust pod count automatically.
  2. Step 2: Compare options with this role

    Only Automatically adjust the number of pods based on CPU or custom metrics describes automatic scaling based on load, which matches the autoscaler's purpose.
  3. Final Answer:

    Automatically adjust the number of pods based on CPU or custom metrics -> Option D
  4. Quick Check:

    Autoscaler adjusts pods automatically = A [OK]
Hint: Autoscaler changes pod count automatically based on load [OK]
Common Mistakes:
  • Confusing autoscaler with manual pod management
  • Thinking it balances network traffic
  • Assuming it stores data persistently
2. Which of the following is the correct YAML snippet to define a Horizontal Pod Autoscaler targeting CPU utilization at 50% for a deployment named web-app?
easy
A. apiVersion: autoscaling/v2\nkind: HorizontalPodAutoscaler\nmetadata:\n name: web-app-hpa\nspec:\n scaleTargetRef:\n apiVersion: apps/v1\n kind: Deployment\n name: web-app\n minReplicas: 1\n maxReplicas: 5\n metrics:\n - type: Resource\n resource:\n name: cpu\n target:\n type: Utilization\n averageUtilization: 70
B. apiVersion: v1\nkind: Pod\nmetadata:\n name: web-app\nspec:\n containers:\n - name: web-app\n image: web-app:latest
C. apiVersion: autoscaling/v1\nkind: HorizontalPodAutoscaler\nmetadata:\n name: web-app-hpa\nspec:\n scaleTargetRef:\n apiVersion: apps/v1\n kind: Deployment\n name: web-app\n minReplicas: 2\n maxReplicas: 10\n targetCPUUtilizationPercentage: 50
D. apiVersion: autoscaling/v2beta2\nkind: HorizontalPodAutoscaler\nmetadata:\n name: web-app-hpa\nspec:\n scaleTargetRef:\n apiVersion: apps/v1\n kind: Deployment\n name: web-app\n minReplicas: 1\n maxReplicas: 5\n metrics:\n - type: Resource\n resource:\n name: memory\n target:\n type: Utilization\n averageUtilization: 50

Solution

  1. Step 1: Identify correct API version and fields for CPU target

    autoscaling/v1 supports targetCPUUtilizationPercentage directly; v2 requires metrics array.
  2. Step 2: Check min/max replicas and target CPU utilization

    apiVersion: autoscaling/v1\nkind: HorizontalPodAutoscaler\nmetadata:\n name: web-app-hpa\nspec:\n scaleTargetRef:\n apiVersion: apps/v1\n kind: Deployment\n name: web-app\n minReplicas: 2\n maxReplicas: 10\n targetCPUUtilizationPercentage: 50 uses autoscaling/v1 with minReplicas 2, maxReplicas 10, and targetCPUUtilizationPercentage 50, which is valid syntax.
  3. Final Answer:

    YAML with autoscaling/v1 and targetCPUUtilizationPercentage 50% -> Option C
  4. Quick Check:

    autoscaling/v1 + targetCPUUtilizationPercentage = B [OK]
Hint: autoscaling/v1 uses targetCPUUtilizationPercentage field [OK]
Common Mistakes:
  • Using wrong apiVersion for the fields
  • Confusing CPU with memory metrics
  • Setting minReplicas higher than maxReplicas
3. Given this Horizontal Pod Autoscaler configuration:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-server
  minReplicas: 2
  maxReplicas: 6
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 60

If the current CPU usage is 90% and there are 3 pods running, how many pods will the autoscaler try to set?
medium
A. 5 pods
B. 3 pods
C. 6 pods
D. 4 pods

Solution

  1. Step 1: Understand scaling formula based on CPU utilization

    Desired replicas = current replicas * (current CPU / target CPU) = 3 * (90/60) = 4.5
  2. Step 2: Round up and check min/max limits

    4.5 rounds up to 5, which is between minReplicas 2 and maxReplicas 6, so 5 pods will be set.
  3. Final Answer:

    5 pods -> Option A
  4. Quick Check:

    3 * (90/60) = 4.5 -> 5 pods [OK]
Hint: Multiply current pods by (current CPU ÷ target CPU) [OK]
Common Mistakes:
  • Rounding down instead of up
  • Ignoring min/max replica limits
  • Using target CPU as current CPU
4. You configured a Horizontal Pod Autoscaler but notice it never scales pods beyond the minimum replicas even under high load. What is the most likely cause?
medium
A. The maxReplicas is set lower than minReplicas
B. The metrics server is not running or not providing metrics
C. The deployment has too many replicas already
D. The pods are using too little CPU

Solution

  1. Step 1: Check autoscaler dependency on metrics

    Horizontal Pod Autoscaler requires metrics server to get CPU or custom metrics to decide scaling.
  2. Step 2: Understand effect of missing metrics

    If metrics server is missing or not providing data, autoscaler cannot detect load and keeps pods at minReplicas.
  3. Final Answer:

    The metrics server is not running or not providing metrics -> Option B
  4. Quick Check:

    Missing metrics = no scaling beyond minReplicas [OK]
Hint: Autoscaler needs metrics server to scale pods [OK]
Common Mistakes:
  • Assuming maxReplicas lower than minReplicas causes this
  • Thinking high load always triggers scaling
  • Ignoring metrics server setup
5. You want to design a microservices system that scales pods horizontally based on both CPU usage and custom queue length metrics. Which approach best uses Horizontal Pod Autoscaler to achieve this?
hard
A. Configure HPA with multiple metrics: CPU utilization and custom queue length, setting thresholds for both
B. Use two separate HPAs, one for CPU and one for queue length, targeting the same deployment
C. Scale pods manually based on CPU and queue length metrics collected externally
D. Configure HPA to scale only on CPU and ignore queue length metrics

Solution

  1. Step 1: Understand HPA multi-metric support

    Horizontal Pod Autoscaler supports multiple metrics in a single configuration to scale pods based on combined criteria.
  2. Step 2: Evaluate options for best practice

    Configure HPA with multiple metrics: CPU utilization and custom queue length, setting thresholds for both uses multiple metrics in one HPA, which is efficient and avoids conflicts from multiple HPAs targeting the same deployment.
  3. Final Answer:

    Configure HPA with multiple metrics: CPU utilization and custom queue length, setting thresholds for both -> Option A
  4. Quick Check:

    Single HPA with multiple metrics = A [OK]
Hint: Use one HPA with multiple metrics for combined scaling [OK]
Common Mistakes:
  • Using multiple HPAs on same deployment causing conflicts
  • Ignoring custom metrics support
  • Relying only on CPU metrics