Certified Kubernetes Administrator Exam Series (Part-9): Troubleshooting

In the previous blog of this 10-part series, we discussed Networking. This blog introduces you to the common types of errors that occur and the techniques you can use to identify and resolve them.

Here are the eight other blogs in the series:

Introduction

Kubernetes is a powerful and complex system that allows developers to easily deploy, manage, and scale applications. However, with so many components working together, errors will inevitably arise. Troubleshooting is a critical skill for anyone working with Kubernetes, as it enables administrators to identify and resolve problems quickly and efficiently.

Common Errors in Kubernetes

The section is packed with practical labs exploring failure and resolution scenarios for applicable knowledge on debugging and troubleshooting running Kubernetes clusters.

I. Application Failure

Sometimes, the application deployed into Kubernetes may not behave as expected. This can be caused by one of these three things: Pods, Services, or Replication Controllers. Let’s see how to debug and troubleshoot these components.

Troubleshooting PODs

A Pod may be unavailable to users for a number of reasons. The first step when checking for Pod failure is to inspect its state. A Pod typically falls into one of these five states: Pending,  Running, Succeeded, Failed, and Unknown.

  • Pending state: This means that the Kubernetes system has accepted it, but one or more container images have not been created. A Pod stays in a Pending state when it is still being scheduled and when the container images are being downloaded over a network. In this case, all one has to do is wait for the Pod to come up completely or restart it if the failure persists.
  • Running state: This means that all containers in the Pod are running properly. 
  • Succeeded state: This means every container in the Pod has been run successfully and will not need a restart.
  • Failed state: This means one or more containers in the Pod have failed.
  • Unknown state: This means there’s an error connecting the Kubernetes API Server to the Pod, so its state cannot be established.

Learn more about Pod's states from this blog: Kubernetes Readiness Probe: A Simple Guide with Examples.

Checking a Pod's State

To check the state of a Pod state, run the command below:

kubectl describe pods <pod-name>

A Pod may also fail because the application fails to start. This can be due to reasons such as exceeding resource limits or lack of permissions. In this case, the failure can be tracked by describing events that happen within the Pods namespace. Events can be viewed using the command:

kubectl get events --namespace <namespace-name>

If a Pod is running, but the application is not behaving as expected, the first step involves checking the affected container’s logs. This can be done using the command:

kubectl logs <pod-name> <container-name>

If a Pod previously crashed and then restarted, the error may be tracked by checking the previous container’s crash log using the command:

kubectl logs --previous <pod-name> <container-name>

Some container images come bundled with debugging utilities. Debugging commands can be run within the container using the kubectl exec utility:  

kubectl exec ${POD_NAME} -c ${CONTAINER_NAME} – ${CMD} ${ARG1} ${ARG2} ... ${ARGN}

Debugging Replication Controllers

Replication controllers may fail to create Pod replicas as needed. To inspect events related to a specific replication controller, use the command:

kubectl describe rc ${CONTROLLER_NAME}

Troubleshooting Services

Services allow for host discovery and load balancing across nodes. Several problems may cause a service to stop working correctly, making an application inaccessible. 

It is important to ensure the service has its endpoints correctly configured and active. The endpoints should match the Pods expected to be accessed using the service. To view a service’s endpoints, the following command is used:

kubectl get endpoints <service-name>

To check if a service is missing endpoints, list the Pods using its labels. This is achieved using the command:

kubectl get pods --selector=name=<selector-name>,type=<type-label>

If there are Pods expected to be on the list but are missing, edit their labels and selectors to point to the intended service. 

If network traffic is still not being forwarded, then the administrator can check for faults in the DNS Server, iptables rules, and kube-proxy. 

Control Plane Failure

The first step to debugging any cluster involves checking that all nodes are working correctly. This can be assessed using the command:

kubectl get nodes

Once you have seen the running nodes, you can check each individual node’s status using the command:

kubectl describe node <node-name>

A running node is described by several conditions, including:

  • Ready: This condition indicates whether a node is healthy and ready to accept Pods. When this condition is True, the node can accept Pods. When this condition is False, the node is unhealthy and can’t accept Pods. If the controller hasn’t been connected to its controller within a certain time frame (grace period), then the condition is Unknown.
  • DiskPressure: If the disk capacity is too low for the Pods' needs, this condition is  True; otherwise, it is False.
  • MemoryPressure: This is True if the workload’s memory overwhelms the node’s memory capacity; otherwise, it is False.
  • PIDPressure: This is True if too many processes are running on the node and False if not.
  • NetworkUnavailable: If the network has been configured improperly, this property is True; otherwise, it is False.

Sometimes, one may have to look at each control plane service’s specific logs for event and error information. For instance, the log file for the kube-apiserver can be found in the /var/log/kube-apiserver.log directory. The logs for the Kube Scheduler can be found in the /var/log/kube-scheduler.log directory, while those of the replication controllers can be found in /var/log/kube-scheduler.log.

A Pod's logs can be accessed explicitly using the command: 

 kubectl logs <pod-name>

The logs for control plane components running on worker nodes can also be accessed directly. Kubelet logs are located in /var/log/kubelet.log, while those of the Kube-Proxy service can be found in /var/log/kube-proxy.log

On a node in which Kubernetes was deployed using Kubeadm, control plane components can be deployed as Pods in the Kube-System namespace. To list the running control plane services, the Pods running in the namespace are listed using the command: 

kubectl logs <pod-name>

These components can also be deployed directly as services. Their status can be viewed by running this command:

kubectl service <component-name> status

Event entries for every service in this namespace can also be accessed through the API server’s logs using this command:

kubectl logs kube-apiserver-controlplane -n kube-system

If a service has been configured natively on the master node, its logs can be viewed using a logging solution deployed on the host. One such solution is journalctl, whose entries can be viewed using the command: 

sudo journalctl -u kube-apiserver

Learn how to navigate Kubernetes Logs from this blog:

Navigating the Everest of Logs - A Guide to Understanding Kubelet Logs
Use this guide to better understand Kubernetes log, how to locate logs and increasing Kubelet log levels and more.

Worker Node Failure

Just like control plane nodes, worker nodes need to be checked for health and availability. The general procedure for troubleshooting worker nodes is as follows: 

  • Check that worker nodes are running.
  • Examine particular nodes that are not running as expected.
  • If a node is unavailable, check its heartbeat for the last known communication.
  • Check for CPU, Memory, and Disk resources required by the Pods vs those provided by the nodes.
  • Check Kubelet logs and certificates for networking and authentication issues.  

Network Troubleshooting

Kubernetes supports multiple network plugins that can fail in many possible ways. In its element, a Kubernetes network relies on a bridge and IP forwarding for network communication. If the IP forwarding setting is altered in any way, the networks start to fail. The first step in troubleshooting networks is ensuring that IP forwarding and the bridge netfilter capability are on. To verify that IP forwarding is on, the following command is used:

sysctl net.ipv4.ip_forward

The Bridge-Netfilter capability enables the working of iptables rules. If this setting is disabled, Pods cannot access services outside the Pod network since the destination hosts are unreachable. This can be diagnosed using the command:

sysctl net.bridge.bridge-nf-call-iptables

Several other networking issues exist for Kubernetes clusters. Some of these include:

  • Pod CIDR conflicts
  • Source Destination Checks
  • Firewall rules blocking overlay traffic  

This concludes the Networking section of the CKA certification exam.

You can now proceed to the next part of this series: Certified Kubernetes Administrator Exam Series (Part-10): Practice Topics

Here is the previous part of the series: Certified Kubernetes Administrator Exam Series (Part-8): Networking.

Summary

Kubernetes’ large, hyperconnected environments introduce various sources of potential error. This class covers various aspects of cluster failures and how to detect them.

KodeKloud’s highly practical lessons include labs and practice tests on the candidate’s ability to debug Kubernetes clusters. This class helps the candidate generate adequate knowledge to inspect cluster components, ensuring workloads run as expected.