In parts 1 and 2 of this 3-article series, we saw the impact of CPU limits and how resource requests can affect cost optimization performance. In this final article of the series, we will look at a few other industry best practices that can help to minimize cost and maximize performance.
- Using the appropriate network topology and tuning your application for your varying workloads can save on costs while increasing performance.
- For effective observability, you can create architectural schemes for visibility at any level in your cluster to track down any cost-performance anomalies.
- Common metrics to monitor across the cluster to track application performance include disk pressure, throttling, kubelet status, and so on.
Best Practices for Optimizing Kubernetes Clusters for Cost & Performance
In the previous two articles, we focused on optimizing cost and performance by managing containers’ resource consumption. Below are some other best practices that will help you minimize the costs and maximize the performance of your cluster.
Utilize the Appropriate Cloud Provider Option
Below are some of the cloud service providers’ options you can utilize to optimize cost and performance:
- Use serverless compute: Depending on your workload requirements, you can consider running your workloads using serverless compute. This is cost-effective and performant, especially when you don’t want to deal with too much overhead. But, there are some considerations to take note of depending on the cloud provider. For example, for AWS EKS, Daemonsets aren’t supported on Fargate.
- Use GPUs only for the appropriate workload: To minimize cost while also maximizing performance, use GPUs only for workloads that require accelerated computing, e.g., deep learning models. Use cost-effective performant CPUs for other workloads based on your workload resource needs.
- Use spot instances for non-critical workloads: In AWS, for example, spot instances are the cheapest purchasing option to access EC2 capacity. However, it’s not recommended to run critical workloads or workloads that require a minimum level of performance on spot instances. It can be a huge cost-saving opportunity to run your BestEffort Pods on spot instances.
Improve Networking Connectivity
Choose a network topology that optimally balances cost and performance. Pod Topology Spread Constraints are used to control the placement of Pods across regions, zones, nodes, and other user-defined topology domains. You need to understand your varying workload networking requirements to efficiently group applications with similar network needs when necessary. This will increase performance and potentially save on data transfer charges.
Optimize the Application and its Runtime Parameters
Optimize your application and use lightweight container images to optimize cost and performance. Also, consider fine-tuning application runtime parameters to appropriately match application resource requirements. For example, properly tuning various Java Virtual Machine (JVM) parameters while considering the resource requirements of your workloads can optimize cost and performance.
Continuous Monitoring for Cost and Performance
To effectively and continuously monitor your cluster for cost and performance optimization, you need visibility at all layers. A best practice to improve visibility is to track cost and performance based on architectural schemes. This approach helps you in attributing costs and performance to their origins.
Architectural schemes in this context refer to resource groupings, which can be based on a development team, business unit, project, environment, and so on. To create an architectural scheme, you need to create namespaces. Then, create more architectural concepts within those namespaces based on your needs. This will help you track the cost and performance of any Kubernetes resources (e.g., Pods, deployment) used by a development team or for a project.
To implement architectural schemes, you can use Labels in Kubernetes or Tags in Cloud Providers. These allow you to use key-value pairs to organize and track resources. This helps you achieve cost and performance visibility at any level of granularity at scale, which is important for anomaly contributors tracking.
Additionally, cloud service providers have cost management, observability, and monitoring tools. These tools, in combination with the Tags and Labels that you have defined, help you gain granular visibility into the cost and performance metrics of your cluster. For example, AWS has Cost Explorer, CloudWatch, etc. Other 3rd party tools like Datadog can also be integrated with your cloud service provider to provide useful granular insights.
As mentioned earlier, the industry's best practice is to track usage at all levels. The following are some examples of performance metrics you can start tracking today:
- Node level metrics: This includes CPU utilization, memory pressure (memory consumption percentage of a node), disk utilization/disk pressure (if node disk utilization is above a specified threshold, disk pressure can occur), node not ready state or status, etc.
- Pod level metrics: This includes CPU throttling, readiness probes failures, memory pressure, Pod non-running states or status (e.g., Pending, Unknown, OOMKilled), Pod restarts, etc.
- Control plane level metrics: This includes kubelet state or status, reconcile latency (e.g., time between a Pod creation request and the actual Pod creation), etc.
Do you have more actionable strategies to minimize cost and maximize performance? Or know more useful metrics to monitor? You can share them in the comment section below.
There is no magic silver bullet to optimize clusters for cost and performance. But with continuous observability of the right metrics at every level in your cluster and proper alerting mechanisms in place, you can configure your workloads appropriately, detect any issues early on, and act according to deliver your defined SLOs.
Are you ready to take your skills to the next level? Become a subscriber today!