Certified Kubernetes Administrator Exam Series (Part-5): Cluster Maintenance

In the previous blog of this 10-part series, we discussed Kubernetes Security. This post takes the candidate through various cluster maintenance processes, including upgrading the Operating System, the implications of evicting a node from the cluster, Kubernetes Releases and versions, the cluster upgrade process, and upgrade best practices, among others. 

Here are the nine other blogs in the series:

Introduction

Kubernetes clusters require maintenance to ensure that nodes, PODs, and other resources operate optimally. Cluster maintenance practices fall under three broad categories: 

  • Operating System Upgrades, 
  • Cluster Upgrades, and 
  • Backup & Restore Methodologies.  

OS Upgrades

When a node in a cluster goes down, the Pods running inside it automatically become inaccessible. Users may not be able to access cluster services hosted in the Pods. However, Pods with instances running in other nodes will have their workloads unaffected. If the failed node returns online immediately, the Kubelet service starts, and the Pods becomes available. 

However, if the node is unavailable for 5 minutes, the pods are permanently terminated. Pods that were part of a ReplicaSet are recreated inside other nodes.

Pods are permanently terminated if the node stays unavailable for 5 minutes.

The time it takes to wait before the Pods are terminated is known as the POD Eviction Timeout and is set on the Kube-Controller-Manager with a default value of 5 minutes:

kube-controller-manager --pod-eviction-timeout= 5m0s

If a node comes back online after the POD Eviction Timeout, it is blank, with no Pods scheduled on it. This means that only quick updates can be performed on the nodes, and they should be rebooted before the timeout period or their Pods will be available. 

For node updates that are expected to take a longer time, like an Operating System Upgrade, the drain command is used. This redistributes the Pods to other nodes in the cluster:

kubectl drain node-1

When a node is drained, the Pods are gracefully terminated in the node and then recreated in other cluster nodes. The node being drained is marked as unschedulable so no other Pods can be assigned to it. Even when it comes back online, the node is still unschedulable until the developers lift the tag off using the uncordon command:

kubectl uncordon node-1

The node does not recover any Pods previously scheduled on it. Rather, newer pods can be scheduled after it is uncordoned.

The cordon command marks a node as unschedulable. However, unlike the drain command, this one does not terminate existing Pods on a node. It just ensures that no newer PODs are scheduled on the node:

kubectl cordon node-1

Kubernetes Software Versions

Clusters run on specific versions of Kubernetes. The Kubernetes cluster can be seen in the following output by using the kubectl get nodes command:

NAME           STATUS   ROLES                  AGE     VERSION
controlplane   Ready    control-plane,master   9m52s   v1.17.3
node01         Ready    <none>                 9m11s   v1.17.3

Kubernetes versions follow an x.y.z pattern. The Kubernetes VERSION here is v1.17.3. The anatomy of the version shows how the Cloud Native Foundation (CNCF) manages Kubernetes releases through a standard procedure. v1 is the major release version. The first major version was (v1.0) released in July 2015. .17 represents the minor version. Minor versions typically come with new features and improvements. The latest stable Kubernetes version is (v1.21). The .3 represents a patch version that comes with the latest bug fixes.

CNCF also releases Alpha and Beta versions of the software to initially test the effectiveness of new features and improvements. Alpha versions typically have new features disabled by default and come with a lot of bugs. Beta releases feature well-tested code plus new features and improvements. These features finally make their way to the stable release.

All stable releases can be found on the official releases page in the Kubernetes Github repository. All Kubernetes components can be downloaded in the tar.gz file, which, when extracted, contains most of the control plane elements in the same version. The ETCD Cluster and CoreDNS Server are of different versions since separate projects manage them. 

Introduction to Cluster Upgrades

Core Control Plane Components do not have to run on the same Kubernetes release version. The kube-apiserver is the primary control plane component and is always in communication with other components. Ideally, no component should, therefore, be running on a version newer than kube-apiserver. Here is a summary of version compatibility:

  • The kube-scheduler and kube-controller can run on a version that is at most one version older than the kube-apiserver. This means that if kube-apiserver is running on version v1.17, then kube-scheduler and kube-controller can run on versions v1.17 and v1.16
  • Both kubelet and kube-proxy can run on versions that are up to 2 levels older than the kube-apiserver
  • The kubectl can run on a version older or newer than the kube-apiserver. This makes it easy to upgrade a cluster component-by-component as required. 
💡
Kubernetes only supports the three latest versions of software. Components are best upgraded during a minor release. It is also important to upgrade the components one minor version at a time rather than skipping straight to the latest release.

The upgrade procedure generally depends on where the cluster is set up. If the application is running on a managed cloud service like Google Cloud, the upgrade can be performed with a few simple clicks of a button. If the cluster was created from scratch, then every component has to be upgraded manually. If the cluster was set up using kubeadm, then cluster upgrades are performed by simply running the upgrade plan and upgrade apply commands. In a kubeadm cluster, the upgrade process  follows two major steps:

  • Upgrading the master node
  • Upgrading the worker nodes

When upgrading the master node, all control plane components become briefly unavailable. Cluster management functions are inactive, but the worker nodes continue running workloads. Cluster components and resources cannot be accessed by Kubernetes API tools such as kubectl

There are different strategies with which the worker nodes can be upgraded:

  • In the first strategy, all worker nodes are taken down simultaneously, upgraded, and then brought back up. During this type of upgrade, the Pods are down, and users cannot access the application. After the upgrade, the nodes are brought back up so new Pods can be scheduled, and the application starts running.
  • The second strategy involves upgrading the worker nodes one at a time. In this case, when one node is being upgraded, its workload is shifted to the remaining nodes in the cluster. When it comes back up, workloads from the next node are moved into the upgraded node, allowing the next node to get an upgrade. This process is repeated until all nodes in the cluster have been upgraded.
  • The third strategy involves introducing newer worker nodes running a new version of software into the cluster. As a new node joins the cluster, workloads are assigned from an existing node. The old node then exits the cluster. This process is repeated until all cluster nodes have been replaced by new ones. This approach is particularly effective when the cluster is hosted on a managed cloud platform, which makes it easy to provision new instances and decommission existing ones.

In a kubeadm cluster, the Master node upgrade can be planned by first upgrading kubeadm using the command:

apt-get upgrade -y kubeadm=1.17.0-00

The latest available upgrades are then checked by running the command:

kubeadm upgrade plan

This command outputs the current version information and the latest release available for all cluster components. The upgrade can then be performed using the command:

kubeadm upgrade apply

Running the kubectl get nodes command will display the master node still running the older version of Kubernetes. This is because the command checks for the version of kubelet running on the host. The service is upgraded using the command: 

apt-get upgrade -y kubelet=1.17.0-00

To upgrade a worker node, its workloads are first safely redistributed using the drain command:

kubectl drain node-1 --ignore-daemonsets

This command reassigns Pods in the current node to other worker nodes within the cluster. It then cordons the node, meaning no additional Pods can be scheduled on it during the upgrade.

kubeadm is then upgraded using the command:

apt-get upgrade -y kubeadm=1.17.0-00

The kubelet service is then updated using the command: 

apt-get upgrade -y kubelet=1.17.0-00

The node can then be upgraded to the latest version using the command:

kubeadm upgrade node-1

The node now runs the newest version of Kubernetes software. For Pods to be scheduled on it, the node should be uncordoned using the command:

kubectl uncordon node-1

The node is now back on and can accept workloads from other nodes to be upgraded, and new Pods can be scheduled on it. This process is repeated until all worker nodes in the cluster have their software versions upgraded.

Backup and Restore Methods

Backup and Restore Using Resource Configuration

Kubernetes cluster resources are created by defining their configurations using the imperative or declarative approach.

  • Imperative management involves directly manipulating resources through commands such as kubectl run or kubectl create. This approach is useful for quick, one-off changes and for debugging purposes. However, it can be challenging to manage larger or more complex deployments using imperative commands alone.
  • Declarative management, on the other hand, involves defining the desired state of a deployment in a YAML or JSON file and using. kubectl apply to ensure that the current state matches the desired state. This approach is more scalable and easier to manage, as it allows for version control and more easily reproducible deployments. This approach is preferred for developers who want to save and share their application’s configuration information.

It is important to store copies of resource manifest files. A good practice is to have them in a repository where the development team can access and manage them. The repository should be configured with proper backup options. A great choice is a public, managed repository like GitHub or GitLab, where developers don’t have to worry about providing backup and restoration options. This ensures that a cluster can be easily recovered by applying the configuration files stored in the repository if a cluster is lost. 

Not everyone in a team has to follow the declarative approach in creating resources. If resources were created imperatively, the cluster can be backed up by first querying the kube-apiserver on information regarding cluster resources. This configuration information can then be saved so that it can be referred to later when the resources need restoration. To query the Kubernetes API on resource configuration, the following command is used:

kubectl get all --all-namespaces -o yaml > darwin-service-config.yaml

This command gets configuration information on all resources in all namespaces within the cluster in YAML format and then saves them in a file named darwin-service-config.yaml. If the cluster needs to be restored, the resources can easily be rebuilt using the configuration information in the file.

Tools like VELERO can also help create backups by querying the API and storing configuration information periodically.  

Backup and Restore Using ETCD

The ETCD Server stores data about cluster resources. Backing up the ETCD cluster can be an effective alternative to storing resource configuration information. The data directory where this information is stored is specified when configuring the ETCD service. Backup tools can be configured to access this location periodically for backup. 

The etcdctl tool includes a built-in snapshot utility that lets developers create a backup snapshot of the ETCD cluster instantaneously.  A snapshot is created using the command:

ETCDCTL_API=3 etcdctl --endpoints=https://[127.0.0.1]:2379 --cacert=/etc/kubernetes/pki/etcd/ca.crt \
     --cert=/etc/kubernetes/pki/etcd/server.crt --key=/etc/kubernetes/pki/etcd/server.key \
     snapshot save /tmp/snapshot.db

This saves the current ETCD cluster information in a file named snapshot.db. To save this file in another location, the file path is appended to the above command.

The status of the backup can be viewed using the command:

ETCDCTL_API=3 etcdctl --endpoints=https://[127.0.0.1]:2379 --cacert=/etc/kubernetes/pki/etcd/ca.crt      --cert=/etc/kubernetes/pki/etcd/server.crt --key=/etc/kubernetes/pki/etcd/server.key      snapshot status /tmp/snapshot.db

The snapshot is restored by pointing etcdctl to the snapshot file:

ETCDCTL_API=3 etcdctl --endpoints=https://[127.0.0.1]:2379 --cacert=/etc/kubernetes/pki/etcd/ca.crt \
     --name=master \
     --cert=/etc/kubernetes/pki/etcd/server.crt --key=/etc/kubernetes/pki/etcd/server.key \
     --data-dir /var/lib/etcd-from-backup \
     --initial-cluster=master=https://127.0.0.1:2380 \
     --initial-cluster-token etcd-cluster-1 \
     --initial-advertise-peer-urls=https://127.0.0.1:2380 \
     snapshot restore /tmp/snapshot.db

Modify /etc/kubernetes/manifests/etcd.yaml: 
Update –data-dir to use new target location

--data-dir=/var/lib/etcd-from-backup

Update new initial-cluster-token to specify a new cluster

--initial-cluster-token=etcd-cluster-1

Update volumes and volume mounts to point to the new path

   volumeMounts:
    - mountPath: /var/lib/etcd-from-backup
      name: etcd-data
    - mountPath: /etc/kubernetes/pki/etcd
      name: etcd-certs
  hostNetwork: true
  priorityClassName: system-cluster-critical
  volumes:
  - hostPath:
      path: /var/lib/etcd-from-backup
      type: DirectoryOrCreate
    name: etcd-data
  - hostPath:
      path: /etc/kubernetes/pki/etcd
      type: DirectoryOrCreate
    name: etcd-certs

When ETCD restores from this backup, it initializes a new cluster configuration and configures its components to run as new members of the new cluster. This prevents newly created members from joining existing clusters. After the manifest files are modified ETCD cluster should automatically restart.

The cluster now runs using the configurations stored on the snapshot of the ETCD server.

Conclusion

This concludes the Cluster Maintenance section of the CKA certification exam. You can also send your feedback to the course developers, whether you have feedback or would like something changed within the course. 

You can now proceed to the next part of this series: Certified Kubernetes Administrator Exam Series (Part-6): Security

Here is the previous part of the Certified Kubernetes Administrator Exam Series: Certified Kubernetes Administrator Exam Series (Part-4): Application Lifecycle Management

Research Questions

Here is a quick quiz with a few questions and tasks to help you assess your knowledge. Leave your answers in the comments below and tag us back. 

Quick Tip – Questions below may include a mix of DOMC and MCQ types.

1. We need to take node01 out for maintenance. Empty the node of all applications and mark it unschedulable. What command should we use?

Answer:

$ kubectl drain node01 --ignore-daemonsets

2. You are tasked to upgrade the cluster. Users accessing the applications must not be impacted. And you cannot provision new VMs. What strategy would you use to upgrade the cluster?

[A] Upgrade all nodes at once

[B] Users will be impacted since there is only one worker node

[C] Upgrade one node at a time while moving the workloads to the other

3. Upgrade the controlplane components to version v1.20.0

Answer:

On the controlplane node, run the command run the following commands:
$ apt update
This will update the package lists from the software repository.


$apt install kubeadm=1.20.0-00
This will install the kubeadm version 1.20


$ kubeadm upgrade apply v1.20.0
This will upgrade kubernetes controlplane. Note that this can take a few minutes.


$ apt install kubelet=1.20.0-00 
This will update the kubelet with the version 1.20.


You may need to restart the kubelet after it has been upgraded.
Run: $ systemctl restart kubelet

4. Where is the ETCD server certificate file located?

[A] /etc/kubernetes/pki/server.crt

[B] /etc/kubernetes/pki/etcd/peer.crt

[C] /etc/kubernetes/pki/etcd/ca.crt

[D] /etc/kubernetes/pki/etcd/server.crt

5. We have a Webapp, a critical app in node01. For this reason, we do not want it to be removed and do not want to schedule any more pods on node01. How do you mark node01 as unschedulable so that no new pods are scheduled on this node?

Answer:

$ kubectl cordon node01

6. Where is the ETCD CA Certificate file located?

[A] /etc/kubernetes/pki/etcd/ca.crt

[B] /etc/kubernetes/pki/ca.crt

[C] /etc/kubernetes/pki/etcd/peer.crt

[D] /etc/kubernetes/pki/etcd/ca.key

7. The master nodes in our cluster are planned for a regular maintenance reboot tonight. While we do not anticipate anything to go wrong, we are required to take the necessary backups. Take a snapshot of the ETCD database using the built-in snapshot functionality. Store the backup file at location /opt/snapshot-pre-boot.db.

Try on your own to do this first, and then validate with the right approach as mentioned below.

ETCDCTL_API=3 etcdctl --endpoints=https://[127.0.0.1]:2379 \ --cacert=/etc/kubernetes/pki/etcd/ca.crt \ --cert=/etc/kubernetes/pki/etcd/server.crt \ --key=/etc/kubernetes/pki/etcd/server.key \ snapshot save /opt/snapshot-pre-boot.db

Conclusion

As Kubernetes clusters are highly distributed, cluster maintenance processes and procedures form a crucial task for any Kubernetes administrator. This post has explored various processes encompassing the maintenance of nodes and cluster resources. The class also introduces OS upgrades and Kubernetes software versions to help candidates make sure applications run in production as expected.

Exam Preparation Course

Our CKA Exam Preparation course explains all the Kubernetes concepts included in the certification’s curriculum. After each topic, you get interactive quizzes to help you internalize the concepts learned. At the end of the course, we have mock exams that will help familiarize you with the exam format, time management, and question types.

Explore our CKA exam preparation course curriculum.

Enroll Now!