After Ubuntu release upgrade, pods are crashing continuously

mca_75 · December 12, 2024, 4:13pm

Hello! I recently upgraded a worker node’s Operating System from Ubuntu 20.04 LTS to Ubuntu 22.04 LTS. Immediately after that, some pods began to be in the CrashLoopBackOff state constantly, that is, they run OK for a while and then suddenly crash. This keeps happening periodically. Kube-flannel and kube-proxy are among these pods that are crashing all the time. Some of these crashed pods show the following message: “Pod sandbox changed, it will be killed and re-created”. Does anyone here have a clue as to how to solve this issue? Thanks in advance!

Santosh_KodeKloud · December 12, 2024, 4:32pm

Hi @mca_75

What do the logs for the said Pods show?

kubectl logs -n <namespace> <pod-name> --previous

mca_75 · December 12, 2024, 5:19pm

Hi, @Santosh_KodeKloud! I’m not allowed to upload attachments and, besides, I’m getting the following error after pasting the log contents: “Sorry, new users can only put 2 links in a post”. But I can tell you that the logs contents were not useful to me at all in order to help identifying the problem cause. The most useful information I got was running "kubectl describe pod [pod-name] -n [namespace] and I’m pasting a sample output below:

Events:
Type Reason Age From Message

Normal Killing 45m (x37 over 4h11m) kubelet Stopping container kube-proxy
Normal SandboxChanged 21m (x42 over 4h11m) kubelet Pod sandbox changed, it will be killed and re-created.
Normal Pulled 16m (x42 over 4h11m) kubelet Container image “registry.k8s.io/kube-proxy:v1.30.7” already present on machine
Warning BackOff 86s (x889 over 4h11m) kubelet Back-off restarting failed container kube-proxy in pod kube-proxy-9cxcj_kube-system(2a5a4b1c-ca15-4c03-b06b-28af38df6060)

Santosh_KodeKloud · December 12, 2024, 6:06pm

What is the value set for SystemdCgroup in your node’s /etc/containerd/config.toml?

github.com

containerd/containerd/blob/main/docs/cri/config.md#basic-configuration

# CRI Plugin Config Guide
This document provides the description of the CRI plugin configuration.
The CRI plugin config is part of the containerd config (default
path: `/etc/containerd/config.toml`).

See [here](https://github.com/containerd/containerd/blob/main/docs/ops.md)
for more information about containerd config.

Note that the `[plugins."io.containerd.grpc.v1.cri"]` section is specific to CRI,
and not recognized by other containerd clients such as `ctr`, `nerdctl`, and Docker/Moby.

## Config versions
The content of `/etc/containerd/config.toml` must start with a version header, for example:
```toml
version = 3
```

The config version 3 was introduced in containerd v2.0.
The config version 2 used in containerd 1.x is still supported and automatically
converted to the config version 3.

This file has been truncated. show original

mca_75 · December 13, 2024, 10:37am

This is my node’s /etc/containerd/config.toml:

[plugins.“io.containerd.grpc.v1.cri”.containerd.runtimes.runc]
[plugins.“io.containerd.grpc.v1.cri”.containerd.runtimes.runc.options]
SystemdCgroup = true

mca_75 · December 13, 2024, 10:58am

I’m also getting some messages from “journalctl -xau kubelet”:

dez 13 07:54:32 srval653 kubelet[925]: E1213 07:54:32.854352 925 pod_workers.go:1298] “Error syncing pod, skipping” err="[failed to "StartContainer" for "liveness-probe" with CrashLoopBackOff: "back-off 5m0s restarting failed container=liveness-probe pod=csi-smb-node-7xtl8_ku>
dez 13 07:54:37 srval653 kubelet[925]: I1213 07:54:37.871060 925 scope.go:117] “RemoveContainer” containerID=“5eb5c05b63edd8dc93b6b083ae868e7a3493eecbb19781eaed1811dc2ddf9482”
dez 13 07:54:37 srval653 kubelet[925]: E1213 07:54:37.871694 925 pod_workers.go:1298] “Error syncing pod, skipping” err="failed to "StartContainer" for "node-exporter" with CrashLoopBackOff: "back-off 5m0s restarting failed container=node-exporter pod=prometheus-prometheus-no>
dez 13 07:54:42 srval653 kubelet[925]: I1213 07:54:42.853802 925 scope.go:117] “RemoveContainer” containerID=“7142ac59cd5aa5cdd33e34fc8d3930d3ede0fc0e98687846d3e409776a172769”
dez 13 07:54:42 srval653 kubelet[925]: E1213 07:54:42.854588 925 pod_workers.go:1298] “Error syncing pod, skipping” err="failed to "StartContainer" for "speaker" with CrashLoopBackOff: "back-off 5m0s restarting failed container=speaker pod=speaker-v6tpq_metallb-system(e0857f5>
dez 13 07:54:42 srval653 kubelet[925]: I1213 07:54:42.868482 925 scope.go:117] “RemoveContainer” containerID=“31674148404fbfc4a231b91ef164cb5f2917bdd16798217f62a7f0f70691000a”
dez 13 07:54:42 srval653 kubelet[925]: E1213 07:54:42.868986 925 pod_workers.go:1298] “Error syncing pod, skipping” err=“failed to "StartContainer" for "kube-flannel" with CrashLoopBackOff: "back-off 5m0s restarting failed container=kube-flannel pod=kube-flannel-ds-744gl_kube>
dez 13 07:54:45 srval653 kubelet[925]: I1213 07:54:45.853253 925 scope.go:117] “RemoveContainer” containerID=“2e0f31884a77ef2cdcb29d21bade2e804ddbf8f832f2667121e14468c1363b28”
dez 13 07:54:45 srval653 kubelet[925]: I1213 07:54:45.853313 925 scope.go:117] “RemoveContainer” containerID=“70d90c1778ad22af3bc24ef940f56e17acf6e5a02cf3b49026a09bc7988b1e14”
dez 13 07:54:45 srval653 kubelet[925]: I1213 07:54:45.853330 925 scope.go:117] “RemoveContainer” containerID=“34747a7b1f1e6418891b74b3586f8588a802783be642fdf2a6aa0e273411a161”
dez 13 07:54:45 srval653 kubelet[925]: E1213 07:54:45.854310 925 pod_workers.go:1298] “Error syncing pod, skipping” err=”[failed to "StartContainer" for "liveness-probe" with CrashLoopBackOff: "back-off 5m0s restarting failed container=liveness-probe pod=csi-smb-node-7xtl8_ku>
dez 13 07:54:49 srval653 kubelet[925]: I1213 07:54:49.867604 925 scope.go:117] “RemoveContainer” containerID=“5eb5c05b63edd8dc93b6b083ae868e7a3493eecbb19781eaed1811dc2ddf9482”
dez 13 07:54:49 srval653 kubelet[925]: E1213 07:54:49.868022 925 pod_workers.go:1298] “Error syncing pod, skipping” err="failed to "StartContainer" for "node-exporter" with CrashLoopBackOff: "back-off 5m0s restarting failed container=node-exporter pod=prometheus-prometheus-no

mca_75 · December 13, 2024, 11:45am

Hi, @Santosh_KodeKloud!

After you alerted me about the /etc/containerd/config.toml file, I decided to test the following commands:

containerd config default | sudo tee /etc/containerd/config.toml >/dev/null 2>&1
sed -i ‘s/SystemdCgroup = false/SystemdCgroup = true/g’ /etc/containerd/config.toml
systemctl restart containerd

The new /etc/containerd/config.toml file, generated from the commands above, has much more content than the previous one’s, which I posted you before and used to work fine at Ubuntu 20.04 LTS. Now, I’m going to monitor whether or not this action has some positive effect on the pods that are crashing all the time. I’ll keep you informed. Thank you!

mca_75 · December 18, 2024, 10:40am

Additional detail: I realized afterwards that the problem affected only those pods that were part of daemonsets and, therefore, had to run forcefully at the misconfigured nodes. Solution above confirmed.

@Santosh_KodeKloud