Pods recovered from a restored etcd is superficial

I was able to restore etcd from a snapshot.
However, the recovered pod (which was deleted earlier as a test) seem superficial.
As you can see from the screen-capture, the pod exist but trying to run a command in the pod returns an error that the pod doesn’t exist.
So what gives?

TBH I’m not sure what’s going on there. k get says the pod is there. What does “delete-me” do? If it deletes the pod, then all bets are off.

and just to prove i wasn’t smokin something that i’m not supposed to, here’s a screenshot from k9s where it recognizes the existence of the pod but streaming out the logs returns back an error that the pod cannot be found.

oh that’s just the name of the folder where i’m currently at… you can ignore that part

Still don’t know what’s up. If you run k describe po -o wide, what do you see now?

the details from the describe command makes one believe the pod is legit as you can from the output below but again it’s superficial as the info from the describe command is from the pod captured before the snapshot creation and got carried over after etcd was restored (clue: events shows 63m ago)… i tried restoring it again but same thing.

Name:             most-critical-podv1
Namespace:        cka
Priority:         0
Service Account:  default
Node:             worker-1/192.168.1.125
Start Time:       Sat, 06 Apr 2024 09:45:24 +0800
Labels:           run=most-critical-podv1
Annotations:      <none>
Status:           Running
IP:               10.0.1.7
IPs:
  IP:  10.0.1.7
Containers:
  most-critical-podv1:
    Container ID:   containerd://f6e75e7d2d0f9985860ef09dc07e94cd281f50ffb8f0771192ab2c20da429ed8
    Image:          nginx
    Image ID:       docker.io/library/nginx@sha256:6db391d1c0cfb30588ba0bf72ea999404f2764febf0f1f196acd5867ac7efa7e
    Port:           <none>
    Host Port:      <none>
    State:          Running
      Started:      Sat, 06 Apr 2024 09:45:27 +0800
    Ready:          True
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-cvxp4 (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             True
  ContainersReady   True
  PodScheduled      True
Volumes:
  kube-api-access-cvxp4:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type    Reason     Age   From               Message
  ----    ------     ----  ----               -------
  Normal  Scheduled  63m   default-scheduler  Successfully assigned cka/most-critical-podv1 to worker-1
  Normal  Pulling    63m   kubelet            Pulling image "nginx"
  Normal  Pulled     63m   kubelet            Successfully pulled image "nginx" in 2.410806133s (2.410810757s including waiting)
  Normal  Created    63m   kubelet            Created container most-critical-podv1
  Normal  Started    63m   kubelet            Started container most-critical-podv1

steps i did and how i managed to get myself into this predicament:

  1. create a pod named aptly most-critical-podv1
  2. create an etcd snapshot
  3. delete the pod
  4. restore etcd from the snapshot
  5. test if deleted pod was recovered from the restored snapshot by running a command inside the pod

if you have bandwidth at the moment, can you try recreate this simple step from your end and confirm if pod is indeed alive?

Can’t really do the whole process. And you have something weird here – I doubt I’d replicate what’s causing the weirdness. So here’s some things to try:

  1. k -n cka get po most-critical-podv1 -o wide, to get the node.
  2. ssh THE-NODE
  3. crictl ps | grep most-critical-podv1 to make sure the container actually exists on the node.
  4. exit
  5. k -n cka get po most-critical-podv1 -o yaml > mcp.yaml to see if you get something that makes sense.

If all of these things work, the question is why exec does not work, because as we say in US English, if it walks like a duck, and talks like a duck, it’s a… duck.

well well well, lo and behold, the issue is also manifested in the practice exam… was able to restore the damn thing but its also superficial and the practice exam makes you believe the restoration is all well and good but as you can see in the screenshot, it’s the same error i was getting in my local environment. the deployment now exists and the pod shows it is running after etcd restoration but trying to run a command in one of the pods errors out with “does not exist”!

… so there must be another step missing here e.g., rollout-restart all deployments after restoration? I can see how that would work but what about stand-alone pods, there’s no way to restart it unless you generate and re-create from a manifest yaml file but that defeats the objective of what the backup and restore of etcd was all about?

no need to do the whole process, can easily be reproduced from the practice exam as you can see from previous post before this