Certified Kubernetes Administrator Exam Series (Part-9): Troubleshooting

Introduction
The Kubernetes environment typically spans multiple machines, running different services and components. Anything can go wrong, and administrators need a way to track and work on errors when running Kubernetes applications at production. Errors can go wrong at different levels, and this gives rise to various types of failure. This section introduces these failures and the techniques that can be used to identify and resolve errors at each level of the Kubernetes environment. The section is packed with practical labs exploring failure and resolution scenarios for applicable knowledge on debugging and troubleshooting running Kubernetes clusters.
Application Failure
Sometimes the application deployed into Kubernetes may not behave as expected. When a Kubernetes application is not running, it could be because of failure in one of three things: PODs, Services and Replication Controllers. This section explores how to debug and troubleshoot these components.
Troubleshooting PODs
A POD may not be available to users for a number of reasons. The first step when checking for POD failure is to inspect its state. A POD typically goes through five phases: Pending
, Running
, Succeeded
, Failed
, and Unknown
.
If a POD is in Pending
state, it means that the Kubernetes system has accepted it, but one or more container images have not been created. A POD stays in Pending
state when it is still being scheduled and when container images are being downloaded over a network. In this case, all one has to do is wait for the POD to come up completely or restart it if the failure persists.
A Running
POD is healthy, bound to a node and has all containers active.
- If a POD is in the
Succeeded
phase, every container has been run successfully, and will not need a restart. - In the
Failed
phase, one or more containers have ended in failure. - If a POD is listed as
Unknown
, there’s an error connecting the Kubernetes API Server to the POD, and so its state cannot be established.
To check for POD state, the following command is used:
$ kubectl describe pods <pod-name>
A POD may also fail since the application fails to start due to reasons such as exceeding resource limits or lack of permissions. In this case, the failure can be tracked by describing events that happen within the PODs namespace. Events can be viewed using the command:
$ kubectl get events --namespace <namespace-name>
If a POD is running but the application is not behaving as expected, the first step involves checking the affected container’s logs. This can be done using a command of the format:
$ kubectl logs <pod-name> <container-name>
If a POD previously crashed then restarted, the error may be tracked by checking the previous container’s crash log using the command:
$ kubectl logs --previous <pod-name> <container-name>
Some container images come bundled with debugging utilities. Debugging commands can be run within the container using the kubectl exec
utility:
$ kubectl exec ${POD_NAME} -c ${CONTAINER_NAME} -- ${CMD} ${ARG1} ${ARG2} ... ${ARGN}
Debugging Replication Controllers
Replication controllers may fail to create POD replicas as needed. To inspect events related to a specific replication controller, the following command is used:
$ kubectl describe rc ${CONTROLLER_NAME}
Troubleshooting Services
Services allow for host discovery and load balancing across nodes. Several problems may cause a service to stop working correctly, making an application inaccessible.
It is important to make sure the service has its endpoints correctly configured and active. The number of endpoints should also match with the PODs expected to be accessed using the service. To view a service’s endpoints, the following command is used:
$ kubectl get endpoints <service-name>
If a service is missing any endpoints, it’s time to list the PODs using its labels. This is achieved using the command:
$ kubectl get pods --selector=name=<selector-name>,type=<type-label>
If there are PODs expected to be on the list but missing, then their labels
and selectors
are edited to point to the intended service.
If network traffic is still not being forwarded, then the administrator can check for faults in the DNS Server, iptables rules and kube-proxy.
Control Plane Failure
The first step to debugging any cluster involves checking that all nodes are working correctly. This can be assessed using the command:
$ kubectl get nodes
Once you have seen the running nodes, you can check each individual node’s status using the command:
$ kubectl describe node <node-name>
A running node is described by several conditions, including:
Ready
: This condition indicates whether a node is healthy and ready to accept PODs. When this condition isTrue
, the node can accept PODs, if it is unhealthy then the condition isFalse
andUnknown
if the controller hasn’t been connected to its controller within a certain time frame (grace period).
DiskPressure
: If the disk capacity is too low for the PODs needs, this condition isTrue
, andFalse
otherwise.
MemoryPressure
: This isTrue
if the workload’s memory needs overwhelm the node’s memory capacity, andFalse
otherwise.
PIDPressure
: This isTrue
if there are too many processes running on the node, andFalse
if not.
NetworkUnavailable
: If the network has been configured improperly, this property isTrue
andFalse
otherwise.
In some cases, one may have to look at each control plane service’s specific logs for event and error information. For instance, the log file for the kube-apiserver
can be found in the /var/log/kube-apiserver.log
directory. The logs for the Kube Scheduler can be found in the /var/log/kube-scheduler.log
directory while those of the replication controllers can be found in /var/log/kube-scheduler.log
.
A PODs logs can be accessed explicitly using the command:
$ kubectl logs <pod-name>
The logs for control plane components running on worker nodes can also be accessed directly. Kubelet logs are located in /var/log/kubelet.log
while those of the Kube-Proxy service can be found in /var/log/kube-proxy.log
.
On a node in which Kubernetes was deployed using Kubeadm, control plane components can be deployed as PODs in the Kube-System namespace. To list the running control plane services, the PODs running in the namespace are listed using the command:
$ kubectl logs <pod-name>
These components can also be deployed directly as services. Their status can be viewed by running a command similar to:
$ service <component-name> status
Event entries for every service in this namespace can also be accessed through the API server’s logs using a command of the form:
$ kubectl logs kube-apiserver-controlplane -n kube-system
If a service has been configured natively on the master node, its logs can be viewed using a logging solution deployed on the host. One such solution is journalctl
whose entries can be viewed using the command:
$ sudo journalctl -u kube-apiserver
Worker Node Failure
Just like control plane nodes, worker nodes need to be checked for health and availability. The general procedure for troubleshooting worker nodes is:
- Check that worker nodes are running
- Examine particular nodes that are not running as expected
- If a node is unavailable, check its heartbeat for the last known communication
- Check for CPU, Memory and Disk resources required by the PODs vs that provided by the nodes
- Check Kubelet logs and certificates for networking and authentication issues.
Network Troubleshooting
Kubernetes supports multiple network plugins that can fail in many possible ways. In its element, a Kubernetes network relies on a bridge and IP forwarding for network communication. If the IP forwarding setting is altered in any way, the networks start to fail. The first step in troubleshooting networks is ensuring that IP forwarding and the bridge netfilter capability are on. To verify that IP forwarding is on, the following command is used:
$ sysctl net.ipv4.ip_forward
The Bridge-Netfilter capability enables the working of iptables
rules. If this setting is disabled, PODs cannot access services outside the POD network since the destination hosts are unreachable. This can be diagnosed using the command:
$ sysctl net.bridge.bridge-nf-call-iptables
Several other networking issues exist for Kubernetes clusters. Some of these include:
- POD CIDR conflicts
- Source Destination Checks
- Firewall rules blocking overlay traffic
Summary
Kubernetes’ large, hyperconnected environments introduce various sources of potential error. This class covers various aspects of cluster failures and how to detect them. KodeKloud’s highly practical lessons include labs and practice tests on the candidate’s ability to debug Kubernetes clusters. This class helps the candidate generate adequate knowledge to inspect cluster components, ensuring workloads run as expected.
More details about KodeKloud’s CKA course with access to the lessons, labs, mock exams and demo can be found here – https://kodekloud.com/courses/certified-kubernetes-administrator-cka/.