K8s Master node join problem with ETCD

Tauqeer-Ahamd · April 26, 2025, 10:08am

Hi, I have a cluster with 3 master and 6 worker nodes and I am testing everything for Production level. Today I upgrade my k8s cluster from 1.30.11 to 1.31.8. after upgrade of master-node-3 I see error in etcd cluster. Master-node-3 etcd nodes was not able tojoin as reach request was rejected by master-node-1 and master-node2.

No idea what was the issue.

Finaly I decided to drain master 3 and clean it reset kubeadm and re join it. Now I face this issue.

kubeadm join 11.111.111.11:6443 --token czryde.wjsshdjshdjs0ex --discovery-token-ca-cert-hash sha256:f758f5e307kdjhckdhkjehkrjhucsbcknskchf9b689f5831b95 --control-plane --certificate-key 5356710a533f4dc3191faf07e7d2jhfhjgcgfdghfxdfdsd5dfef7e1a02655706348c
[preflight] Running pre-flight checks
[preflight] Reading configuration from the cluster...
[preflight] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -o yaml'
[preflight] Running pre-flight checks before initializing the new control plane instance
        [WARNING FileAvailable--etc-kubernetes-manifests-kube-apiserver.yaml]: /etc/kubernetes/manifests/kube-apiserver.yaml already exists
        [WARNING FileAvailable--etc-kubernetes-manifests-kube-controller-manager.yaml]: /etc/kubernetes/manifests/kube-controller-manager.yaml already exists
        [WARNING FileAvailable--etc-kubernetes-manifests-kube-scheduler.yaml]: /etc/kubernetes/manifests/kube-scheduler.yaml already exists
        [WARNING FileAvailable--etc-kubernetes-manifests-etcd.yaml]: /etc/kubernetes/manifests/etcd.yaml already exists
[preflight] Pulling images required for setting up a Kubernetes cluster
[preflight] This might take a minute or two, depending on the speed of your internet connection
[preflight] You can also perform this action beforehand using 'kubeadm config images pull'
W0426 15:50:13.531549 2189292 checks.go:846] detected that the sandbox image "registry.k8s.io/pause:3.8" of the container runtime is inconsistent with that used by kubeadm.It is recommended to use "registry.k8s.io/pause:3.10" as the CRI sandbox image.
[download-certs] Downloading the certificates in Secret "kubeadm-certs" in the "kube-system" Namespace
[download-certs] Saving the certificates to the folder: "/etc/kubernetes/pki"
[certs] Using certificateDir folder "/etc/kubernetes/pki"
[certs] Using the existing "etcd/server" certificate and key
[certs] Using the existing "etcd/peer" certificate and key
[certs] Using the existing "apiserver-etcd-client" certificate and key
[certs] Using the existing "etcd/healthcheck-client" certificate and key
[certs] Using the existing "apiserver-kubelet-client" certificate and key
[certs] Using the existing "apiserver" certificate and key
[certs] Using the existing "front-proxy-client" certificate and key
[certs] Valid certificates and keys now exist in "/etc/kubernetes/pki"
[certs] Using the existing "sa" key
[kubeconfig] Generating kubeconfig files
[kubeconfig] Using kubeconfig folder "/etc/kubernetes"
[kubeconfig] Using existing kubeconfig file: "/etc/kubernetes/admin.conf"
[kubeconfig] Using existing kubeconfig file: "/etc/kubernetes/controller-manager.conf"
[kubeconfig] Using existing kubeconfig file: "/etc/kubernetes/scheduler.conf"
[control-plane] Using manifest folder "/etc/kubernetes/manifests"
[control-plane] Creating static Pod manifest for "kube-apiserver"
[control-plane] Creating static Pod manifest for "kube-controller-manager"
[control-plane] Creating static Pod manifest for "kube-scheduler"
[check-etcd] Checking that the etcd cluster is healthy
error execution phase check-etcd: error syncing endpoints with etcd: context deadline exceeded
To see the stack trace of this error execute with --v=5 or higher

I export the certificates from working master-node-1 and one thing I notice is this. .

root@k8s-master-3:/# ls -l /etc/kubernetes/manifests/
total 20
-rw------- 1 root root 4588 Απρ  26 15:50 kube-apiserver.yaml
-rw------- 1 root root 4100 Απρ  26 15:50 kube-controller-manager.yaml
-rw------- 1 root root 2169 Απρ  26 15:50 kube-scheduler.yaml

root@k8s-master-3:/# crictl ps
WARN[0000] runtime connect using default endpoints: [unix:///run/containerd/containerd.sock unix:///run/crio/crio.sock unix:///var/run/cri-dockerd.sock]. As the default settings are now deprecated, you should set the endpoint instead.
WARN[0000] image connect using default endpoints: [unix:///run/containerd/containerd.sock unix:///run/crio/crio.sock unix:///var/run/cri-dockerd.sock]. As the default settings are now deprecated, you should set the endpoint instead.
CONTAINER           IMAGE               CREATED             STATE               NAME                ATTEMPT             POD ID              POD
root@k8s-master-3:/# crictl ps -a
WARN[0000] runtime connect using default endpoints: [unix:///run/containerd/containerd.sock unix:///run/crio/crio.sock unix:///var/run/cri-dockerd.sock]. As the default settings are now deprecated, you should set the endpoint instead.
WARN[0000] image connect using default endpoints: [unix:///run/containerd/containerd.sock unix:///run/crio/crio.sock unix:///var/run/cri-dockerd.sock]. As the default settings are now deprecated, you should set the endpoint instead.
CONTAINER           IMAGE               CREATED             STATE               NAME                ATTEMPT             POD ID              POD
root@k8s-master-3:/#

This is no etcd.yaml file. I even copy form working master and change its IP address and still face the same issue. I am not able to find the core problem here. ChatGPT failed co-pilot failed as well and need help with this. Please help me out. . .
Regards,
Tauqeer.A

Tauqeer-Ahamd · April 29, 2025, 1:13pm

Hi @Everyone,
I found the issue. When I run Join command with --v=5 It still fails and see logs about TLS error and final error was same error execution phase check-etcd: error syncing endpoints with etcd: context deadline exceeded
To check the real issue I log the logs of etcd-pod of master one. As that was leader. In that I see error

rejected connection on client endpoint
tls: failed to verify certificate: x509: certificate has expired or is not yet valid
current time 2025-04-28T06:53:44Z is before 2025-04-28T09:47:34Z

That was very strange behaviour as my certificates were valid and only thing left was maybe its different in time on VMS. When I run date -u on each master I found that master-node-3 was 3 hours behind which cause the issue with TLS. After setting time manually join works well.

I wonder how they set K8s cluster with masters in different zone on Cloud. Good point to investigate.

rob_kodekloud · April 29, 2025, 11:59pm

Sounds like time was incorrectly set up on that host. Linux wants to run with its clock set to a valid NTP server; time zones are a UI concept – TZ sets a zone that is used for UI, but not for the clock itself.