Pod eviction doesn't work as intended

Hi everyone,

I’m running Kubernetes v1.24.6 on a cgroups v1 system. I’ve configured my KubeletConfiguration to reserve resources for system processes and set eviction thresholds, but it doesn’t seem to be working as expected.

Here’s my current setup:

kubeReserved:
  cpu: "200m"
  memory: "512Mi"
  ephemeral-storage: "1Gi"
systemReserved:
  cpu: "200m"
  memory: "512Mi"
  ephemeral-storage: "1Gi"
kernelMemcgNotification: true
evictionHard:
  memory.available: "1Gi"    
  nodefs.available: "1Gi"
evictionSoft:
  memory.available: "1.5Gi" 
  nodefs.available: "2Gi"
evictionSoftGracePeriod:
  memory.available: "10s"    
  nodefs.available: "10s"
evictionMaxPodGracePeriod: 10  

I tested this setup by adding more pods to the node via nodeSelector, but the OOM killer gets to act first, essentially crippling the node. Any help here or ideas?

Not a lot. First of all, 1.24.0 was released almost 3 years ago, so it’s been EOL about a year and a half now. That will immediately limit your options.

Also, it would help if you post you configuration in a code block, since your YAML is garbled, and it’s hard for us to even know what’s wrong with it as a result. Use the </> button to create one, or use the Markdown format for a code block.

My bad. Corrected the piece of kubelet configuration
Upgrading k8s is a wonderful idea, but I work for a financial institution so it’s not easy, to say the least-)
What are my options now? I’ve been testing eviction behavior, and I’ve noticed something strange:

:one: When I gradually scale up a test deployment, the node gets tainted with NoSchedule due to MemoryPressure, but I can still schedule more pods on it. I checked for tolerations, but the test deployment doesn’t have any, so I don’t understand why scheduling is still allowed.

:two: Pod evictions do happen sometimes, but only after kubelet itself gets killed by the OOM killer and restarts. This makes me think eviction isn’t happening fast enough, and by the time kubelet reacts, the system is already unstable. The node has 16GB and allocatable is around 14.5GB. Maybe you can provide an example of a kubelet with eviction thresholds that actually function as intended?

My current thinking

I’m considering creating a dedicated cgroup slice for kubelet (kube.slice) to separate it from system processes (system.slice). The idea is to prevent kubelet from competing with other system services for memory, which might help make evictions more predictable.

Would this be a viable approach in my case? Or are there better ways to force kubelet to evict pods before* OOM kicks in and takes down the node?

Appreciate any insights! Thanks in advance!

That sounds pretty low level; and is probably not the approach you’d normally do in a K8s cluster. I’ll assume that your pods are owned by deployments or similar – if you’re using pods directly, you’re definitely doing it wrong :frowning: If so, the first thing to look at is the settings in the resources block in the deployment template, to make sure that the pods have adequate resources and limits set. I’m also curious what kind of loads you have on your nodes, to make sure you’re not trying to run too many pods per node. Taking a look using metrics server might be helpful there.

And if you’re working for a financial institution – still, 1.24?? This version is not getting security fixes, which puts you at risk for being hacked. Ouch!

Goes without saying all the pods are owned by deployments and statefulsets. So what is the best apprach if I want to prevent ny nodes from going into a coma if resource utilization goes through the roof?

Have you considered using something like descheduler which considers the cluster as a whole rather than leaving it to kubelet? Since this component is created by the Kubernetes developers, it’s clearly a way they recommend for getting pods to move around.

Hi, Alisrair!

Yes, we started using descheduler back in December. We’ve been testing it, but so far, only in dev clusters with some success—mainly using the overutilized nodes strategy. Right now, it’s running as a Deployment, but I’m considering switching it to a CronJob that runs every 5 minutes for better control.

Upgrading Kubernetes is in progress, but we have a lot of workloads still relying on PodSecurityPolicy (PSP), which was removed in Kubernetes 1.25—so that’s adding some complexity.