K8s Master node join problem with ETCD

Hi, I have a cluster with 3 master and 6 worker nodes and I am testing everything for Production level. Today I upgrade my k8s cluster from 1.30.11 to 1.31.8. after upgrade of master-node-3 I see error in etcd cluster. Master-node-3 etcd nodes was not able tojoin as reach request was rejected by master-node-1 and master-node2.

No idea what was the issue.

Finaly I decided to drain master 3 and clean it reset kubeadm and re join it. Now I face this issue.

kubeadm join 11.111.111.11:6443 --token czryde.wjsshdjshdjs0ex --discovery-token-ca-cert-hash sha256:f758f5e307kdjhckdhkjehkrjhucsbcknskchf9b689f5831b95 --control-plane --certificate-key 5356710a533f4dc3191faf07e7d2jhfhjgcgfdghfxdfdsd5dfef7e1a02655706348c
[preflight] Running pre-flight checks
[preflight] Reading configuration from the cluster...
[preflight] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -o yaml'
[preflight] Running pre-flight checks before initializing the new control plane instance
        [WARNING FileAvailable--etc-kubernetes-manifests-kube-apiserver.yaml]: /etc/kubernetes/manifests/kube-apiserver.yaml already exists
        [WARNING FileAvailable--etc-kubernetes-manifests-kube-controller-manager.yaml]: /etc/kubernetes/manifests/kube-controller-manager.yaml already exists
        [WARNING FileAvailable--etc-kubernetes-manifests-kube-scheduler.yaml]: /etc/kubernetes/manifests/kube-scheduler.yaml already exists
        [WARNING FileAvailable--etc-kubernetes-manifests-etcd.yaml]: /etc/kubernetes/manifests/etcd.yaml already exists
[preflight] Pulling images required for setting up a Kubernetes cluster
[preflight] This might take a minute or two, depending on the speed of your internet connection
[preflight] You can also perform this action beforehand using 'kubeadm config images pull'
W0426 15:50:13.531549 2189292 checks.go:846] detected that the sandbox image "registry.k8s.io/pause:3.8" of the container runtime is inconsistent with that used by kubeadm.It is recommended to use "registry.k8s.io/pause:3.10" as the CRI sandbox image.
[download-certs] Downloading the certificates in Secret "kubeadm-certs" in the "kube-system" Namespace
[download-certs] Saving the certificates to the folder: "/etc/kubernetes/pki"
[certs] Using certificateDir folder "/etc/kubernetes/pki"
[certs] Using the existing "etcd/server" certificate and key
[certs] Using the existing "etcd/peer" certificate and key
[certs] Using the existing "apiserver-etcd-client" certificate and key
[certs] Using the existing "etcd/healthcheck-client" certificate and key
[certs] Using the existing "apiserver-kubelet-client" certificate and key
[certs] Using the existing "apiserver" certificate and key
[certs] Using the existing "front-proxy-client" certificate and key
[certs] Valid certificates and keys now exist in "/etc/kubernetes/pki"
[certs] Using the existing "sa" key
[kubeconfig] Generating kubeconfig files
[kubeconfig] Using kubeconfig folder "/etc/kubernetes"
[kubeconfig] Using existing kubeconfig file: "/etc/kubernetes/admin.conf"
[kubeconfig] Using existing kubeconfig file: "/etc/kubernetes/controller-manager.conf"
[kubeconfig] Using existing kubeconfig file: "/etc/kubernetes/scheduler.conf"
[control-plane] Using manifest folder "/etc/kubernetes/manifests"
[control-plane] Creating static Pod manifest for "kube-apiserver"
[control-plane] Creating static Pod manifest for "kube-controller-manager"
[control-plane] Creating static Pod manifest for "kube-scheduler"
[check-etcd] Checking that the etcd cluster is healthy
error execution phase check-etcd: error syncing endpoints with etcd: context deadline exceeded
To see the stack trace of this error execute with --v=5 or higher

I export the certificates from working master-node-1 and one thing I notice is this. .

root@k8s-master-3:/# ls -l /etc/kubernetes/manifests/
total 20
-rw------- 1 root root 4588 Απρ  26 15:50 kube-apiserver.yaml
-rw------- 1 root root 4100 Απρ  26 15:50 kube-controller-manager.yaml
-rw------- 1 root root 2169 Απρ  26 15:50 kube-scheduler.yaml
root@k8s-master-3:/# crictl ps
WARN[0000] runtime connect using default endpoints: [unix:///run/containerd/containerd.sock unix:///run/crio/crio.sock unix:///var/run/cri-dockerd.sock]. As the default settings are now deprecated, you should set the endpoint instead.
WARN[0000] image connect using default endpoints: [unix:///run/containerd/containerd.sock unix:///run/crio/crio.sock unix:///var/run/cri-dockerd.sock]. As the default settings are now deprecated, you should set the endpoint instead.
CONTAINER           IMAGE               CREATED             STATE               NAME                ATTEMPT             POD ID              POD
root@k8s-master-3:/# crictl ps -a
WARN[0000] runtime connect using default endpoints: [unix:///run/containerd/containerd.sock unix:///run/crio/crio.sock unix:///var/run/cri-dockerd.sock]. As the default settings are now deprecated, you should set the endpoint instead.
WARN[0000] image connect using default endpoints: [unix:///run/containerd/containerd.sock unix:///run/crio/crio.sock unix:///var/run/cri-dockerd.sock]. As the default settings are now deprecated, you should set the endpoint instead.
CONTAINER           IMAGE               CREATED             STATE               NAME                ATTEMPT             POD ID              POD
root@k8s-master-3:/#

This is no etcd.yaml file. I even copy form working master and change its IP address and still face the same issue. I am not able to find the core problem here. ChatGPT failed co-pilot failed as well and need help with this. Please help me out. . .
Regards,
Tauqeer.A

Hi @Everyone,
I found the issue. When I run Join command with --v=5 It still fails and see logs about TLS error and final error was same error execution phase check-etcd: error syncing endpoints with etcd: context deadline exceeded
To check the real issue I log the logs of etcd-pod of master one. As that was leader. In that I see error

rejected connection on client endpoint
tls: failed to verify certificate: x509: certificate has expired or is not yet valid
current time 2025-04-28T06:53:44Z is before 2025-04-28T09:47:34Z

That was very strange behaviour as my certificates were valid and only thing left was maybe its different in time on VMS. When I run date -u on each master I found that master-node-3 was 3 hours behind which cause the issue with TLS. After setting time manually join works well.

I wonder how they set K8s cluster with masters in different zone on Cloud. Good point to investigate.

Sounds like time was incorrectly set up on that host. Linux wants to run with its clock set to a valid NTP server; time zones are a UI concept – TZ sets a zone that is used for UI, but not for the clock itself.