Optimizing Kubernetes Clusters for Cost & Performance: Part 1 - Resource Requests

Which side of the table are you currently on? On one side, you are wrestling with high costs that have the finance team raising eyebrows and giving you that weird look. But your Kubernetes applications in the cloud run with peak performance, and your customers are happy. On the other side, you have sacrificed application performance to achieve low costs and avoid being in a WAR room with the finance team. But you’re now on the frontline dealing with endless customer complaints.

Well, both sides are less-than-ideal. Whichever side you’re on, you risk losing financial resources and customers. Organizations adopting Kubernetes always strive to strike an optimal balance between cost and performance. They aim to minimize cost and maximize performance while respecting any Service Level Objectives (SLOs) defined.

Efficient resource management by right-sizing your Kubernetes workloads is at the heart of building a high-performing and cost-optimized cluster. So, let’s explore some of the industry's best actionable strategies for optimizing cost and performance in your Kubernetes cluster in this 3-article series.

This first article of the series explores the role of resource requests in optimizing a cluster’s cost and performance.

Key Takeaways

  • Cost and performance optimization in Kubernetes clusters begins with setting appropriate resource requests.
  • Always set memory limits equal to memory requests. 
  • Always set CPU requests.

Why Setting Appropriate Resource Requests is Important

Let’s start by understanding what a resource request is and why we need it.

A resource request defines the amount of resources that need to be reserved for a container. By specifying a resource request, the container can be guaranteed to receive at least the amount of resources that it needs. When a container includes a resource request, the scheduler can verify that the node where the pod will be assigned has enough resources available to meet the container’s requirements.

Below is a sample manifest file that has defined resource requests.

# trimmed-down example of a YAML file showing container resource requests
            cpu: 200m
            memory: 2Gi

If you over-provision resources, it can lead to unnecessary expenses. However, if you under-provision resources, your application’s performance takes a hit. This trade-off is why you should prioritize defining appropriate resource requests that closely match your workload requirements. 

Resource requests play a crucial role in resource management. Kubernetes uses resource requests for tasks such as scheduling and bin packing, defining a Pod’s quality of service (QoS) class, cluster autoscaling, and horizontal Pod autoscaling. Let’s see how setting appropriate resource requests impacts these mechanisms and, ultimately, the cluster’s cost and performance.

Configuring Pods Correctly Based on Your Workload Requirements 

The QoS class of workloads can impact their performance when the nodes in the cluster are under resource pressure. Let’s now look at the three Pod QoS classes, how they are handled when the node is under resource pressure, and the recommended workload configuration based on your requirements.

  • BestEffort Pods: These are Pods with no requests and limits defined for their containers. These Pods can use any amount of resources available. The kubelet kills BestEffort Pods if the node is running out of memory. Hence, BestEffort Pods are meant for workloads that don’t need to run immediately, can tolerate disruptions, and have no specific performance guarantees.
  • Burstable Pods: These are Pods that have at least 1 container with a memory or CPU request or limit, or the resource request doesn’t equal limits.  The kubelet kills Burstable Pods using more memory than they have requested if the node is under resource pressure. They are killed only after the BestEffort Pods have been killed. As the name suggests, Burstable Pods are meant for workloads that occasionally need additional resources to meet short-term bursts in demand.
  • Guaranteed Pods: These are Pods with containers where the requests are equal to limits. You can set both requests and limits explicitly or only set the limits (Kubernetes copies the limit and uses that value for the request). These Pods can’t burst; hence, they’re guaranteed not to be killed before BestEffort or Burstable Pods. Guaranteed Pods are ideal for critical workloads where an expected level of performance is to be delivered based on defined service level objectives (SLOs).

 Below are some recommended best practices:

  • You should “avoid using BestEffort Pods for workloads that require a minimum level of reliability.
  • You should “set memory requests equal to memory limits for all containers in all Burstable Pods.
  • You should “always set appropriate CPU requests for your workloads, especially workloads that need a guaranteed level of performance.

There are opportunities for cost savings and performance optimization when you don’t run a lot of BestEffort Pods in your cluster. Since BestEffort Pods have no requests or limits, more resources than necessary may be allocated. This results in you being charged even for unused resources. Also, if BestEffort pods and Burstable Pods with insufficient memory requests are terminated, it can disrupt your application, breach your SLO, and negatively impact your end-user experience.

Ensuring Efficient Bin Packing

When you set appropriate CPU and memory requests, it enables efficient automatic cluster bin packing. Bin packing can use resource requests to pack (allocate) pods efficiently in a way that minimizes the number of nodes used in a Kubernetes cluster. This leads to high node utilization, which optimizes cost while ensuring that workloads are guaranteed the amount of resources they need to run reliably without any disruption.

If you don’t set resource requests correctly, you can negatively impact bin packing. For instance, it can lead to low cluster bin packing (under-utilized nodes) where more nodes than necessary are created and CPU and memory resources are wasted. This results in extra costs for idle resources.

Achieving Effective Horizontal Workload and Cluster Autoscaling

Resource requests play a critical role in ensuring efficient Horizontal Pod Autoscaling (HPA) and Cluster Autoscaling (CA), which are important for cost and performance optimization. Let’s now see how CPU requests assist in autoscaling.

Horizontal Pod Autoscaler (HPA)

One of the ways of triggering the HPA is to use a target CPU utilization percentage, a figure calculated using the CPU request. Now, imagine HPA scaling based on an inappropriate resource request, like a significantly high CPU request, especially when the load is low. The target CPU utilization percentage would be too high, which would prevent the HPA from scaling down at the right time. 

This leads to wasted resources and increased costs because the resources available are more than what’s actually needed. But if HPA scales down based on resource requests that closely match workload resource usage, applications will perform reliably while optimizing costs by not over-provisioning resources.

Cluster Autoscaler (CA)

We saw in the previous section that setting the right resource requests can lead to efficient bin packing. This would mean that the resources in the nodes available are efficiently utilized. A CA takes advantage of this by ensuring that nodes that aren’t needed are removed from the cluster to save on cost without impacting performance.

At the same time, if the efficiently packed nodes get more traffic and hence need more resources, the CA can prevent node failures by creating more nodes. So, setting the right requests helps the CA optimize the cluster for cost and performance.

But how do you rightsize your workloads? How can you properly estimate the resource request values to minimize cost and maximize performance? How can you also always ensure that your workloads have resource requests set? Let’s explore these next.

Estimating Appropriate Resource Requests

Rightsizing your workload requests will help you achieve your cost optimization performance objective. It prevents you from over-provisioning to minimize infrastructure costs and also from under-provisioning to maximize application performance. But how do you get to the optimal point where cost justifies performance, and performance justifies cost?

One way is to perform load testing to understand how the application behaves under different loads and measure resource usage using observability tools like Grafana for a period of time. Then, you can take the peak or maximum, make it a bit higher by an x% margin for a safe buffer, and use that for your resource requests. For example, if peak usage is 300m (i.e. 0.3 CPU cores), for your resource request, you may add a safety net of 15%, which makes it 345m. Performing load testing makes sense for your most critical applications. However, it may not be feasible for every workload in your Kubernetes cluster.

The Vertical Pod Autoscaler (VPA) and a monitoring solution (metrics server is the default, but Prometheus can be optionally configured) can help you choose appropriate resource requests to rightsize your workloads. VPA can automatically set Pod resource requests based on resource usage. However, a VPA has its limitations, and using it at scale can be challenging. There are also other open-source tools like Goldilocks that you can use to tackle this issue.

In addition, you can identify workloads in your Kubernetes cluster that have no memory and/or CPU resource requests set by running this CLI tool.

But it is also good to foster a proactive culture where engineers understand the impact of resource requests and work collaboratively to set them appropriately.

You can also implement guardrails like OPA Gatekeeper, which is an admission controller webhook to enforce setting resource request policies. Also, be cautious when implementing LimitRanges, because the defaults may not appropriately rightsize your workloads. While these defaults are good for BestEffort pods that have no requests, you still need to ensure your workload requirements are met to deliver defined SLOs to customers.

Wrapping up

Kesley Hightower, who is famously known for his work on Kubernetes, compared Kubernetes to Tetris. Tetris is a classic video game where you aim to arrange falling blocks of different shapes. Like Tetris, Kubernetes aims to pack your varying workloads (blocks of different shapes and sizes) on nodes in an efficient way (the shapes of blocks help you arrange them well). If the blocks in Tetris are like your workloads in Kubernetes, then the different shapes and sizes of those blocks are like the requests you set for your varying workloads depending on their requirements. 

So, how can you even play Tetris if your blocks have no shapes? In the same vein, when you set appropriate resource requests for your workloads, Kubernetes packs them efficiently, resulting in cost and performance optimization.

However, just setting appropriate resource requests is not a silver bullet to solve all your cost and performance optimization challenges, but it is a good start in your journey. 

In the next article of this 3-article series, we’ll explore:

Optimizing Kubernetes Clusters: Part 2 - Impact of CPU Limits
Get expert insights to achieve peak Kubernetes cluster performance at minimal cost. Explore the impact of CPU limits in this article.

Are you ready to take your skills to the next level? Become a subscriber today!