Hi,
I have a Kubernetes HA cluster with three control-plane nodes. However, when I take down master1, the whole cluster experiences issues and downtime. master2 and master3 do not continue operating as expected.
The cluster was set up using kubeadm, with a load balancer in front of the three control-plane nodes, and it uses the stacked etcd architecture.
Please, I need guidance from experienced Kubernetes engineers. thanks
When the node dies, you don’t just lose an etcd member — you lose part of the API server fleet and therefore change the traffic pattern. The surviving two nodes must absorb both etcd and apiserver load increases simultaneously.
- The kube-api servers on the dead node vanish
- But the API servers on the remaining nodes now:
- Retry connections to the missing etcd member
- Block on timeouts
- Increase request latency
- Hold open gRPC retries
- This borderline DDOSes the remaining etcd members with retry loops
Control plane components (scheduler, controller manager, other custom controllers you may have installed) perform Leader Election via the API Server
If API server responsiveness drops (because etcd is overloaded with retries):
- Elections take longer
- Leaders can time out
- Controllers become “flappy”
This manifests as:
- delayed pod scheduling
- delayed node heartbeats
- delayed controller actions
Even though etcd is technically alive, the control-plane stalls.
Final note
Having 3 control plane nodes is not a fully guaranteed HA. What it does give you in your setup is load balancing across multiple API servers so that one does not have to do all the work, which is good for clusters with many nodes and workloads. Better resilience may be obtained by using external etcd and running etcd on different hosts to the API servers.
Either way, losing an etcd member is something that needs to be fixed quickly. If etcd is separate to control plane, then it’s only etcd you have to fix, not necessarily the entire control plane node.