How to restore etcd backup in stacked etcd HA cluster

velan987 · March 4, 2024, 11:16am

I have 5 control plane stacked etcd cluster. I am trying to restoring the the etcd backup using below steps

mv /etc/kubernetes/manifests/etcd.yaml .
rm -rf /var/lib/etcd
ETCDCTL_API=3 etcdctl snapshot restore --data-dir /var/lib/etcd snapshot.db
mv etcd.yaml /etc/kubernetes/manifests/

Etcd restore is success but etcd cluster is not forming. It is acting as separate cluster.
Can someone please help me on this?

Alistair_KodeKloud · March 4, 2024, 6:00pm

It is expected to form a new cluster. Please follow carefully the steps for restoring a multinode cluster.

mca_75 · January 17, 2025, 2:38pm

Hi, @velan987 and @Alistair_KodeKloud! I’m going through the same problem with my 3 control plane stacked etcd cluster. The etcd in one of my controlplanes got corrupted, but my K8s cluster is still working with the etcds in the other two controlplanes. I followed the steps suggested above by @Alistair_KodeKloud in order to try having a new etcd cluster with 3 etcds again, but the problem got worse, as shown in the output of the command below, so I had to undo the steps in order to keep it working at least like before:

kubectl get nodes
E0117 08:49:52.557053 1546795 memcache.go:265] couldn’t get current server API group list: Get “https://XX.XX.X.XX:6443/api?timeout=32s”: EOF
E0117 08:49:54.582832 1546795 memcache.go:265] couldn’t get current server API group list: unknown
E0117 08:49:54.587464 1546795 memcache.go:265] couldn’t get current server API group list: unknown
E0117 08:49:54.591215 1546795 memcache.go:265] couldn’t get current server API group list: unknown
E0117 08:49:54.595253 1546795 memcache.go:265] couldn’t get current server API group list: unknown
Error from server (Forbidden): unknown

Do both of you now have any knowledge about how we can solve this kind of situation? Thank you very much!

mca_75 · January 23, 2025, 4:54pm

A friend of mine helped me with the solution for this situation and I tested it successfully. As said before, a new etcd cluster has to be formed.

A summary of the necessary steps required to restore the etcd cluster are shown below.

1 - Stop etcd in all controlplane servers
2 - Identify the first etcd server in the cluster
Run “grep -i ‘initial-cluster=’ etcd.yaml” in each controlplane server
The first etcd server is the one shown in all the outputs above
3 - In the first etcd server
3.1 - Restore the etcd backup
Use the following options: “–name”, “–initial-cluster” and “–initial-advertise-peer-urls”
3.2 - Add the following options to etcd.yaml:
–force-new-cluster=true
–initial-cluster-state=new
3.3 - Start etcd
3.4 - After etcd starts, change etcd.yaml again as shown below and wait etcd restart automatically:
–force-new-cluster=false
–initial-cluster-state=existing
4 - Add the second etcd node to the etcd cluster with “etcdctl member add”
5 - Start etcd in the second etcd node
6 - Repeat steps 4 and 5 to each extra etcd node in the etcd cluster

@velan987 @Alistair_KodeKloud