Certified Kubernetes Administrator Exam Series (Part-9): Troubleshooting

Certified Kubernetes Administrator Exam Series: Troubleshooting

Introduction 

The Kubernetes environment typically spans multiple machines, running different services and components. Anything can go wrong, and administrators need a way to track and work on errors when running Kubernetes applications at production. Errors can go wrong at different levels, and this gives rise to various types of failure. This section introduces these failures and the techniques that can be used to identify and resolve errors at each level of the Kubernetes environment. The section is packed with practical labs exploring failure and resolution scenarios for applicable knowledge on debugging and troubleshooting running Kubernetes clusters.

Application Failure

Sometimes the application deployed into Kubernetes may not behave as expected. When a Kubernetes application is not running, it could be because of failure in one of three things: PODs, Services and Replication Controllers. This section explores how to debug and troubleshoot these components.

Troubleshooting PODs

A POD may not be available to users for a number of reasons. The first step when checking for POD failure is to inspect its state. A POD typically goes through five phases: PendingRunning, Succeeded, Failed, and Unknown.

If a POD is in Pending state, it means that the Kubernetes system has accepted it, but one or more container images have not been created. A POD stays in Pending state when it is still being scheduled and when container images are being downloaded over a network. In this case, all one has to do is wait for the POD to come up completely or restart it if the failure persists.

A Running POD is healthy, bound to a node and has all containers active. 

  • If a POD is in the Succeeded phase, every container has been run successfully, and will not need a restart.
  • In the Failed phase, one or more containers have ended in failure.
  • If a POD is listed as Unknown, there’s an error connecting the Kubernetes API Server to the POD, and so its state cannot be established.

To check for POD state, the following command is used:

$ kubectl describe pods <pod-name>

A POD may also fail since the application fails to start due to reasons such as exceeding resource limits or lack of permissions. In this case, the failure can be tracked by describing events that happen within the PODs namespace. Events can be viewed using the command:

$ kubectl get events --namespace <namespace-name>

If a POD is running but the application is not behaving as expected, the first step involves checking the affected container’s logs. This can be done using a command of the format:

$ kubectl logs <pod-name> <container-name>

If a POD previously crashed then restarted, the error may be tracked by checking the previous container’s crash log using the command:

$ kubectl logs --previous <pod-name> <container-name>

Some container images come bundled with debugging utilities. Debugging commands can be run within the container using the kubectl exec utility:  

$ kubectl exec ${POD_NAME} -c ${CONTAINER_NAME} -- ${CMD} ${ARG1} ${ARG2} ... ${ARGN}

Debugging Replication Controllers

Replication controllers may fail to create POD replicas as needed. To inspect events related to a specific replication controller, the following command is used:

$ kubectl describe rc ${CONTROLLER_NAME}

Troubleshooting Services

Services allow for host discovery and load balancing across nodes. Several problems may cause a service to stop working correctly, making an application inaccessible. 

It is important to make sure the service has its endpoints correctly configured and active. The number of endpoints should also match with the PODs expected to be accessed using the  service. To view a service’s endpoints, the following command is used:

$ kubectl get endpoints <service-name>

If a service is missing any endpoints, it’s time to list the PODs using its labels. This is achieved using the command:

$ kubectl get pods --selector=name=<selector-name>,type=<type-label>

If there are PODs expected to be on the list but missing, then their labels and selectors are edited to point to the intended service. 

If network traffic is still not being forwarded, then the administrator can check for faults in the DNS Server, iptables rules and kube-proxy. 

Control Plane Failure

The first step to debugging any cluster involves checking that all nodes are working correctly. This can be assessed using the command:

$ kubectl get nodes

Once you have seen the running nodes, you can check each individual node’s status using the command:

$ kubectl describe node <node-name>

A running node is described by several conditions, including:

  • Ready: This condition indicates whether a node is healthy and ready to accept PODs. When this condition is True, the node can accept PODs, if it is unhealthy then the condition is False and Unknown if the controller hasn’t been connected to its controller within a certain time frame (grace period).
  • DiskPressure: If the disk capacity is too low for the PODs needs, this condition is  True, and False otherwise.
  • MemoryPressure: This is True if the workload’s memory needs overwhelm the node’s memory capacity, and False otherwise.
  • PIDPressure: This is True if there are too many processes running on the node, and False if not.
  • NetworkUnavailable: If the network has been configured improperly, this property is True and False otherwise.

In some cases, one may have to look at each control plane service’s specific logs for event and error information. For instance, the log file for the kube-apiserver can be found in the /var/log/kube-apiserver.log directory. The logs for the Kube Scheduler can be found in the /var/log/kube-scheduler.log directory while those of the replication controllers can be found in /var/log/kube-scheduler.log.

A PODs logs can be accessed explicitly using the command: 

$ kubectl logs <pod-name>

The logs for control plane components running on worker nodes can also be accessed directly. Kubelet logs are located in /var/log/kubelet.log while those of the Kube-Proxy service can be found in /var/log/kube-proxy.log

On a node in which Kubernetes was deployed using Kubeadm, control plane components can be deployed as PODs in the Kube-System namespace. To list the running control plane services, the PODs running in the namespace are listed using the command: 

$ kubectl logs <pod-name>

These components can also be deployed directly as services. Their status can be viewed by running a command similar to:

$ service <component-name> status

Event entries for every service in this namespace can also be accessed through the API server’s logs using a command of the form:

$ kubectl logs kube-apiserver-controlplane -n kube-system

If a service has been configured natively on the master node, its logs can be viewed using a logging solution deployed on the host. One such solution is journalctl whose entries can be viewed using the command: 

$ sudo journalctl -u kube-apiserver

Worker Node Failure

Just like control plane nodes, worker nodes need to be checked for health and availability. The general procedure for troubleshooting worker nodes is: 

  • Check that worker nodes are running
  • Examine particular nodes that are not running as expected
  • If a node is unavailable, check its heartbeat for the last known communication
  • Check for CPU, Memory and Disk resources required by the PODs vs that provided by the nodes
  • Check Kubelet logs and certificates for networking and authentication issues.  

Network Troubleshooting

Kubernetes supports multiple network plugins that can fail in many possible ways. In its element, a Kubernetes network relies on a bridge and IP forwarding for network communication. If the IP forwarding setting is altered in any way, the networks start to fail. The first step in troubleshooting networks is ensuring that IP forwarding and the bridge netfilter capability are on. To verify that IP forwarding is on, the following command is used:

$ sysctl net.ipv4.ip_forward

The Bridge-Netfilter capability enables the working of iptables rules. If this setting is disabled, PODs cannot access services outside the POD network since the destination hosts are unreachable. This can be diagnosed using the command:

$ sysctl net.bridge.bridge-nf-call-iptables

Several other networking issues exist for Kubernetes clusters. Some of these include:

  • POD CIDR conflicts
  • Source Destination Checks
  • Firewall rules blocking overlay traffic  

Summary

Kubernetes’ large, hyperconnected environments introduce various sources of potential error. This class covers various aspects of cluster failures and how to detect them. KodeKloud’s highly practical lessons include labs and practice tests on the candidate’s ability to debug Kubernetes clusters. This class helps the candidate generate adequate knowledge to inspect cluster components, ensuring workloads run as expected. 

More details about KodeKloud’s CKA course with access to the lessons, labs, mock exams and demo can be found here – https://kodekloud.com/courses/certified-kubernetes-administrator-cka/.