After Ubuntu release upgrade, pods are crashing continuously

Hello! I recently upgraded a worker node’s Operating System from Ubuntu 20.04 LTS to Ubuntu 22.04 LTS. Immediately after that, some pods began to be in the CrashLoopBackOff state constantly, that is, they run OK for a while and then suddenly crash. This keeps happening periodically. Kube-flannel and kube-proxy are among these pods that are crashing all the time. Some of these crashed pods show the following message: “Pod sandbox changed, it will be killed and re-created”. Does anyone here have a clue as to how to solve this issue? Thanks in advance!

Hi @mca_75

What do the logs for the said Pods show?

kubectl logs -n <namespace> <pod-name> --previous

Hi, @Santosh_KodeKloud! I’m not allowed to upload attachments and, besides, I’m getting the following error after pasting the log contents: “Sorry, new users can only put 2 links in a post”. But I can tell you that the logs contents were not useful to me at all in order to help identifying the problem cause. The most useful information I got was running "kubectl describe pod [pod-name] -n [namespace] and I’m pasting a sample output below:

Events:
Type Reason Age From Message


Normal Killing 45m (x37 over 4h11m) kubelet Stopping container kube-proxy
Normal SandboxChanged 21m (x42 over 4h11m) kubelet Pod sandbox changed, it will be killed and re-created.
Normal Pulled 16m (x42 over 4h11m) kubelet Container image “registry.k8s.io/kube-proxy:v1.30.7” already present on machine
Warning BackOff 86s (x889 over 4h11m) kubelet Back-off restarting failed container kube-proxy in pod kube-proxy-9cxcj_kube-system(2a5a4b1c-ca15-4c03-b06b-28af38df6060)

What is the value set for SystemdCgroup in your node’s /etc/containerd/config.toml?

This is my node’s /etc/containerd/config.toml:

[plugins.“io.containerd.grpc.v1.cri”.containerd.runtimes.runc]
[plugins.“io.containerd.grpc.v1.cri”.containerd.runtimes.runc.options]
SystemdCgroup = true

I’m also getting some messages from “journalctl -xau kubelet”:

dez 13 07:54:32 srval653 kubelet[925]: E1213 07:54:32.854352 925 pod_workers.go:1298] “Error syncing pod, skipping” err="[failed to "StartContainer" for "liveness-probe" with CrashLoopBackOff: "back-off 5m0s restarting failed container=liveness-probe pod=csi-smb-node-7xtl8_ku>
dez 13 07:54:37 srval653 kubelet[925]: I1213 07:54:37.871060 925 scope.go:117] “RemoveContainer” containerID=“5eb5c05b63edd8dc93b6b083ae868e7a3493eecbb19781eaed1811dc2ddf9482”
dez 13 07:54:37 srval653 kubelet[925]: E1213 07:54:37.871694 925 pod_workers.go:1298] “Error syncing pod, skipping” err="failed to "StartContainer" for "node-exporter" with CrashLoopBackOff: "back-off 5m0s restarting failed container=node-exporter pod=prometheus-prometheus-no>
dez 13 07:54:42 srval653 kubelet[925]: I1213 07:54:42.853802 925 scope.go:117] “RemoveContainer” containerID=“7142ac59cd5aa5cdd33e34fc8d3930d3ede0fc0e98687846d3e409776a172769”
dez 13 07:54:42 srval653 kubelet[925]: E1213 07:54:42.854588 925 pod_workers.go:1298] “Error syncing pod, skipping” err="failed to "StartContainer" for "speaker" with CrashLoopBackOff: "back-off 5m0s restarting failed container=speaker pod=speaker-v6tpq_metallb-system(e0857f5>
dez 13 07:54:42 srval653 kubelet[925]: I1213 07:54:42.868482 925 scope.go:117] “RemoveContainer” containerID=“31674148404fbfc4a231b91ef164cb5f2917bdd16798217f62a7f0f70691000a”
dez 13 07:54:42 srval653 kubelet[925]: E1213 07:54:42.868986 925 pod_workers.go:1298] “Error syncing pod, skipping” err=“failed to "StartContainer" for "kube-flannel" with CrashLoopBackOff: "back-off 5m0s restarting failed container=kube-flannel pod=kube-flannel-ds-744gl_kube>
dez 13 07:54:45 srval653 kubelet[925]: I1213 07:54:45.853253 925 scope.go:117] “RemoveContainer” containerID=“2e0f31884a77ef2cdcb29d21bade2e804ddbf8f832f2667121e14468c1363b28”
dez 13 07:54:45 srval653 kubelet[925]: I1213 07:54:45.853313 925 scope.go:117] “RemoveContainer” containerID=“70d90c1778ad22af3bc24ef940f56e17acf6e5a02cf3b49026a09bc7988b1e14”
dez 13 07:54:45 srval653 kubelet[925]: I1213 07:54:45.853330 925 scope.go:117] “RemoveContainer” containerID=“34747a7b1f1e6418891b74b3586f8588a802783be642fdf2a6aa0e273411a161”
dez 13 07:54:45 srval653 kubelet[925]: E1213 07:54:45.854310 925 pod_workers.go:1298] “Error syncing pod, skipping” err=”[failed to "StartContainer" for "liveness-probe" with CrashLoopBackOff: "back-off 5m0s restarting failed container=liveness-probe pod=csi-smb-node-7xtl8_ku>
dez 13 07:54:49 srval653 kubelet[925]: I1213 07:54:49.867604 925 scope.go:117] “RemoveContainer” containerID=“5eb5c05b63edd8dc93b6b083ae868e7a3493eecbb19781eaed1811dc2ddf9482”
dez 13 07:54:49 srval653 kubelet[925]: E1213 07:54:49.868022 925 pod_workers.go:1298] “Error syncing pod, skipping” err="failed to "StartContainer" for "node-exporter" with CrashLoopBackOff: "back-off 5m0s restarting failed container=node-exporter pod=prometheus-prometheus-no

Hi, @Santosh_KodeKloud!

After you alerted me about the /etc/containerd/config.toml file, I decided to test the following commands:

containerd config default | sudo tee /etc/containerd/config.toml >/dev/null 2>&1
sed -i ‘s/SystemdCgroup = false/SystemdCgroup = true/g’ /etc/containerd/config.toml
systemctl restart containerd

The new /etc/containerd/config.toml file, generated from the commands above, has much more content than the previous one’s, which I posted you before and used to work fine at Ubuntu 20.04 LTS. Now, I’m going to monitor whether or not this action has some positive effect on the pods that are crashing all the time. I’ll keep you informed. Thank you!

Additional detail: I realized afterwards that the problem affected only those pods that were part of daemonsets and, therefore, had to run forcefully at the misconfigured nodes. Solution above confirmed.

@Santosh_KodeKloud