Cluster failure after rebooting

alaamelnagy · July 7, 2024, 10:26am

Hey everyone, I’ve hit a roadblock and could really use your help. Our cluster was running smoothly until the machine rebooted. Now, we’re encountering the following error:

E0707 13:24:15.546885 42410 memcache.go:265] couldn’t get current server API group list: Get “https://localhost:6444/api?timeo ut=32s”: dial tcp [::1]:6444: connect: connection refused
E0707 13:24:15.548249 42410 memcache.go:265] couldn’t get current server API group list: Get “https://localhost:6444/api?timeo ut=32s”: dial tcp [::1]:6444: connect: connection refused
The connection to the server localhost:6444 was refused - did you specify the right host or port?

Does anyone know how to resolve this issue?

Alistair_KodeKloud · July 7, 2024, 3:22pm

Where did this log output come from?

alaamelnagy · July 8, 2024, 10:52am

Thank you! I found out the problem was the garbage collector deleting many Kubernetes images after a reboot. Since I am working offline, they didn’t pull again. I increased the time of the garbage collector. Thanks a lot for your help!

Alistair_KodeKloud · July 8, 2024, 6:44pm

If you’re using an airgapped cluster, you should consider hosting an image registry within the private network for the cluster to use. If you’re on a cloud like AWS then it can be the cloud registry, or if in a datcenter then run something like Artifactory or Proget.

alaamelnagy · July 9, 2024, 8:14am

Thanks,AliStair for your advice. I have a few more questions related to this topic.

What do you think about automation tools for installing Kubernetes, like “KURL”? Do you think they are reliable for use in a production environment?

Currently, I am using a private registry called Harbor, but I haven’t yet configured Kubernetes to pull images from it. Instead, I’ve increased the garbage collector time. Do you think this is a temporary solution, or can it be enough for the long term?

Alistair_KodeKloud · July 9, 2024, 6:47pm

Long term you cluster should be using the harbor repo for its images, both for the cluster itself and for the applications it is running. You should have automation in a CI/CD process for pushing app images there. You should have a process for pushing any cluster images you need too.

Increasing garbage collection time risks filling your nodes up with images and running them out of disk space.

You should definitely automate the provisioning of the cluster itself - terraform to build the nodes it runs on and how you install kube itself largely depends on the type of cluster you run (kubeadm, hard way type with OS services, etc)

alaamelnagy · July 9, 2024, 7:18pm

Thank you so much for your support