Kubernetes HPA: Mastering Horizontal Pod Autoscaler Basics and Best Practices

by Nimesha Jinarajadasa
Nimesha Jinarajadasa
Nimesha Jianrajadasa is a DevOps & Cloud Consultant, K8s expert, and instructional content strategist-crafting hands-on learning experiences in DevOps, Kubernetes, and platform engineering.
•
Last updated: June 16, 2025
•
10 min read

Kubernetes Horizontal Pod Autoscaler (HPA) automatically adjusts the number of pod replicas in your Kubernetes cluster based on CPU or memory usage metrics. This capability ensures your application remains responsive and performs well under varying traffic loads. In this article, we will cover the basics of Kubernetes HPA, how it works, best practices, and a hands-on example to help you master this critical Kubernetes feature.

Key Takeaways

The Horizontal Pod Autoscaler (HPA) automatically adjusts the number of pod replicas in Kubernetes based on CPU and memory metrics, ensuring stable application performance under varying loads.
Proper setup of the Metrics Server is essential, as it provides the necessary resource metrics for HPA to make informed scaling decisions; accurate HPA configuration specifies minimum and maximum replicas along with target utilization.
Integrating HPA with other autoscalers such as the Vertical Pod Autoscaler (VPA) and Cluster Autoscaler enhances scaling strategies, allowing for efficient resource management and responsiveness to varying application demands.

Understanding Horizontal Pod Autoscaler (HPA)

At its core, the Kubernetes Horizontal Pod Autoscaler (HPA) is a Kubernetes resource that automatically adjusts the number of pod replicas based on observed CPU and memory usage metrics. This dynamic adjustment ensures your application maintains stable performance even as traffic fluctuates, making it a critical component for production workloads. Horizontal pod autoscaling HPA operates by continuously monitoring specified metrics and making scaling decisions to match the demand.

The HPA utilizes resource metrics like CPU and memory, as well as custom metrics and cpu metrics, to determine the appropriate number of replicas needed. For example, if the average CPU utilization exceeds a predefined threshold, HPA will increase the number of replicas to distribute the load. Conversely, when the demand decreases, HPA reduces the number of replicas, optimizing resource usage and reducing costs. Additionally, the custom metrics api can be leveraged to enhance monitoring capabilities.

This automatic scaling mechanism helps maintain application performance and reliability without manual intervention and automatically scales.

How HPA Works

Understanding how the Horizontal Pod Autoscaler (HPA) operates is crucial for leveraging its full potential. HPA uses a control loop mechanism that typically runs every 15 seconds to manage scaling decisions. During each loop, HPA:

Queries resource usage metrics such as CPU and memory utilization
Uses metrics defined in the HorizontalPodAutoscaler configuration
Refers to target utilization values specified in the configuration

Once HPA gathers the current resource metrics, it calculates the desired number of pod replicas by:

Comparing the current metrics to the target metrics.
Using the ratio of current to desired metric values to determine if scaling actions are needed.
Increasing the number of replicas if the current CPU utilization exceeds the target.
Decreasing the number of replicas if the current CPU utilization falls below the target.

After determining the desired number of replicas, HPA adjusts the actual number of replicas by adding or removing the number of pods. This adjustment considers pod readiness and stabilization windows to prevent rapid fluctuations. By continuously monitoring and responding to resource metrics, HPA ensures that your application can handle varying loads while maintaining optimal performance. Targets minpods maxpods replicas.

Setting Up Metrics Server

Before diving into HPA configuration, setting up the Kubernetes Metrics Server is a crucial step. The Metrics Server provides the necessary resource metrics for HPA to make informed scaling decisions. The easiest way to install the Metrics Server is:

Download the official YAML manifest from https://github.com/nonai/k8s-example-files/tree/main/metrics-server.
Apply the manifest.
Verify the Metrics Server pods are running using kubectl commands.

For the Metrics Server to function correctly, it must interact with Kubelet on each node to gather resource metrics every 15 seconds. Additionally, ensure that the kube-apiserver is configured to enable an aggregation layer for the Metrics Server. This setup allows for seamless integration and accurate metric gathering, which is crucial for HPA to function effectively.

After setting up the Metrics Server, it’s essential to verify its installation and configuration. This can be done by checking CPU and memory metrics using kubectl commands in your EKS cluster. Proper installation and configuration of the Metrics Server lay the foundation for effective autoscaling with HPA, ensuring that your cluster has access to real-time resource metrics APIs.

Configuring HPA in Your Kubernetes Cluster

Configuring the Horizontal Pod Autoscaler (HPA) in your Kubernetes cluster involves creating a HorizontalPodAutoscaler object using a YAML manifest or the kubectl autoscale deployment command. This configuration specifies critical parameters such as the minimum and maximum number of replicas and the target CPU utilization percentage. Accurate pod resource requests and limits are essential for effective scaling, as they determine how HPA adjusts the number of replicas.

The HPA configuration increases the number of replicas when the average CPU utilization exceeds the cpu target and reduces them when it drops below the target range. This dynamic adjustment ensures that your application can handle varying cpu load efficiently, maintaining an optimal replica count with the hpa controller.

To verify the status of the HPA, you can use the kubectl get hpa command, which shows the current number of replicas and utilization metrics. By properly configuring HPA, you can ensure that your application scales automatically based on real-time resource usage, maintaining performance and optimizing resource utilization. This setup allows you to focus on other critical aspects of your application while HPA manages the scaling dynamically.

Practical Example: Implementing HPA

A practical example will help grasp the power of HPA. We’ll deploy a sample application, create a Kubernetes service to expose it, and apply HPA to manage its scaling. This hands-on approach will help you visualize how HPA functions in a real-world scenario.

We’ll start by deploying a php-apache web server, a simple yet effective application to demonstrate HPA’s functionality. Next, we’ll create a Kubernetes service to expose the application via a public endpoint.

Finally, we’ll configure and apply HPA to manage the pod replicas based on CPU utilization. This practical example will solidify your understanding of HPA and its benefits.

Deploy Sample Application

To begin, we’ll deploy a kubectl autoscale deployment php-apache server using a deployment file named ‘deployment.yml’. Key details of this deployment include:

A minimum of 1 replica and a maximum of 10 replicas.
The application simulates high CPU utilization using a CPU stress test to demonstrate HPA’s functionality.
The php-apache server is accessible on port 80, making it easy to visualize the scaling behavior.

The purpose of this deployment is to create a controlled environment where we can observe how HPA responds to changes in CPU usage. A stress test generates high CPU usage, triggering HPA to dynamically scale the number of replicas. This setup provides a clear demonstration of HPA’s capabilities and how it helps maintain application performance under varying loads.

Create Kubernetes Service

Next, we’ll create a Kubernetes service to expose the php-apache application via a public endpoint. The process involves:

Creating the service configuration file, ‘service.yaml’, which defines the necessary specifications to expose the application to external traffic.
Creating the service using the configuration file.
Listing the service and checking its status using kubectl commands.

Creating a Kubernetes service is crucial for making the application accessible to users and for testing the HPA functionality. By exposing the application, we can simulate real-world traffic and observe how HPA scales the pod replicas in response to changes in load during Kubernetes deployments.

This step ensures that the application is ready for the next phase of applying hpa status.

Apply Horizontal Pod Autoscaler

With the sample application deployed and exposed via a Kubernetes service, it’s time to apply the Horizontal Pod Autoscaler (HPA). The HPA configuration specifies the target deployment and the desired CPU utilization percentage. This configuration ensures that the number of pod replicas scales dynamically based on observed metrics.

After configuring the HPA, it can be applied to the Kubernetes cluster using kubectl commands. This step enables HPA to manage the scaling of pod replicas automatically, ensuring that the application can handle varying loads efficiently.

Applying HPA work allows observation of its dynamic adjustment to the number of replicas in response to observed cpu utilization changes, demonstrating its effectiveness.

Testing HPA Functionality

Testing the functionality of HPA is crucial to ensure it behaves as expected under different load conditions. The process involves:

Increasing the load on the application to observe how HPA scales the replicas.
Monitoring the scaling events to track HPA’s response to the load changes.
Decreasing the load to verify HPA’s ability to scale down the replicas.

By following these steps, we can validate that HPA is functioning correctly and dynamically adjusting the number of pod replicas based on CPU utilization. This testing process ensures that HPA can handle real-world traffic variations and maintain optimal application performance.

Increase Load

To simulate increased load on the application, start a separate pod that continuously sends requests to the php-apache service. This can be done by executing a specific command from a new terminal, generating an infinite loop of queries to the service. Additionally, load-testing tools can be used to continuously generate requests.

As the traffic to the application increases, use the kubectl get hpa command to observe HPA’s response. You should see an increase in the number of replicas, indicating that HPA is scaling the application to handle the increased load. This step demonstrates HPA’s ability to dynamically adjust the number of replicas based on CPU utilization.

Monitor Scaling Events

Monitoring HPA scaling events is essential to ensure it adapts to load changes as expected. You can use specific commands to check the events section within your cluster, enabling tracking of replica changes. This monitoring process validates HPA’s functionality by observing the scaling actions in response to load changes.

Effective monitoring of HPA scaling events ensures that your application can handle varying loads without performance degradation. By tracking these events, you can confirm that HPA continuously monitors the number of replicas to maintain optimal performance and resource utilization.

Decrease Load

Once the load generation is stopped, use Ctrl+C to halt the traffic simulation. Next, run the following command: “kubectl get deployment php-apache.” This command will retrieve information about the php-apache deployment.

command to verify the final state of the deployment. Check the adjusted number of replicas to ensure that HPA has scaled down the deployment as expected.

As CPU utilization decreases, the deployment should scale down to the minimum specified replicas using a replication controller. Verify the status of resource usage to ensure the scaling actions are appropriate.

This step confirms that HPA can effectively scale down the number of replicas when the load decreases, maintaining optimal resource utilization.

HPA Limitations

While HPA is a powerful tool, it does have its limitations:

HPA does not initiate scaling actions for workloads that cannot be scaled, such as DaemonSets.
Metrics Server is not intended for monitoring purposes but solely for providing resource metrics for autoscaling.
HPA may not respond quickly enough to sudden spikes in demand, leading to temporary performance issues.

Over-scaling can also be a concern, as it may lead to pending pods if the cluster runs out of available resources. HPA does not measure performance based on IOPS, bandwidth, or storage metrics directly, which can limit its effectiveness in certain scenarios. Understanding these limitations is crucial for setting realistic expectations and ensuring your HPA implementation behaves as expected.

Best Practices for HPA Configuration

Maximizing HPA’s benefits requires adhering to best configuration practices:

Introduce stabilization windows to reduce unnecessary scaling triggered by transient traffic spikes.
Properly tune readiness and liveness probes to help maintain stability during scaling events.
Define custom and external metrics to provide more advanced autoscaling strategies beyond just CPU utilization.

Setting realistic minimum and maximum replica values prevents underperformance and resource wastage. Regularly reviewing and adjusting scaling thresholds based on actual application performance is essential for optimization. By following these best practices, you can ensure that HPA maintains optimal performance and resource utilization for your applications.

Integrating HPA with Other Autoscalers

Integrating the Horizontal Pod Autoscaler (HPA) with other autoscalers can create a more dynamic and efficient scaling solution. Combining HPA with the Vertical Pod Autoscaler (VPA) can be particularly beneficial. While HPA adjusts the number of pod replicas based on metrics like CPU utilization, VPA dynamically adjusts the resource requests and limits for containers within the pods. However, using both simultaneously requires careful configuration to avoid instability, especially if they scale based on the same metrics.

Another powerful combination is using HPA alongside the Cluster Autoscaler. This integration ensures that the cluster itself can expand to accommodate additional replicas when HPA scales out. Cluster Autoscaler works by adding or removing nodes in the cluster based on the resource demands created by the HPA. This synergy prevents delays due to resource shortages and ensures that the infrastructure can meet the demands of added pods during scaling.

Event-driven scaling tools like KEDA (Kubernetes Event-Driven Autoscaling) can further enhance Kubernetes autoscaling by allowing applications to adjust based on specific events from external sources. Integrating HPA, VPA, and KEDA enables a comprehensive autoscaling strategy that efficiently handles variable loads while optimizing resource usage. This multi-faceted approach ensures that your applications remain responsive and efficient under varying conditions.

Usage and Cost Reporting

Horizontal autoscaling introduces complexity and variability in usage and cost reporting. As the number of replicas dynamically changes, tracking resource usage becomes more challenging. Tools like Kubecost are invaluable in this scenario. Kubecost helps organizations track and analyze their spending patterns related to resource usage in Kubernetes.

By providing detailed usage-based allocation reports, teams can gain clarity on their spending and manage resources more efficiently, enhancing resource efficiency. These reports offer insights into resource usage patterns, average utilization of CPU and memory, historical resource usage, memory consumption, resource use, and utilization averageutilization, enabling better resource management and cost optimization.

Understanding the actual usage and spending patterns allows teams to make informed decisions about scaling policies and resource allocation, ensuring that the benefits of autoscaling are realized without unexpected cost overruns.

Summary

In conclusion, the Horizontal Pod Autoscaler (HPA) is a vital component for managing dynamic workloads in Kubernetes. By automatically adjusting the number of pod replicas based on resource usage metrics, HPA ensures stable application performance and optimal resource utilization. Setting up the Metrics Server is crucial for providing the necessary metrics for HPA to function. Proper configuration of HPA, including defining accurate resource requests and limits, is essential for effective scaling.

Implementing HPA through practical examples, testing its functionality, understanding its limitations, and following best practices ensures that your applications can efficiently handle varying loads. Integrating HPA with other autoscalers like VPA and Cluster Autoscaler can further enhance your scaling strategy. By leveraging tools like Kubecost for usage and cost reporting, you can manage resources efficiently and optimize costs. Embrace the power of HPA to make your Kubernetes deployments more resilient and responsive.

Frequently Asked Questions

What is the primary purpose of the Horizontal Pod Autoscaler (HPA)?

The primary purpose of the Horizontal Pod Autoscaler (HPA) is to automatically adjust the number of pod replicas in a Kubernetes deployment based on observed resource usage metrics, such as CPU and memory. This helps maintain optimal performance and resource efficiency in your applications.

How does HPA determine the number of replicas to scale?

HPA determines the number of replicas to scale by employing a control loop that queries resource usage metrics and compares them against target metrics defined in its configuration. This process allows HPA to calculate the desired number of replicas based on the ratio of current to desired metric values.

What are some common limitations of HPA?

HPA is limited by its inability to scale certain workloads, such as DaemonSets, and may experience delays in responding to sudden demand spikes. Additionally, over-scaling can result in pending pods if the cluster resources are insufficient.

How can integrating HPA with other autoscalers improve scaling efficiency?

Integrating HPA with other autoscalers like VPA and Cluster Autoscaler significantly enhances scaling efficiency by dynamically adjusting resource requests and expanding cluster capacity to support additional replicas. This combination ensures optimal resource utilization and responsiveness to varying workload demands.

What tools can help track resource usage and manage costs in a Kubernetes cluster with HPA?

Kubecost is an effective tool for tracking resource usage and managing costs in a Kubernetes cluster with Horizontal Pod Autoscaler (HPA). It offers detailed usage reports that enhance resource management and facilitate cost optimization.

Nimesha Jinarajadasa

Nimesha Jianrajadasa is a DevOps & Cloud Consultant, K8s expert, and instructional content strategist-crafting hands-on learning experiences in DevOps, Kubernetes, and platform engineering.