CKA Troubleshooting Master Reference Link to heading
Domain 4 — Cluster and Node Troubleshooting Link to heading
Mental Model Link to heading
Troubleshooting flow: symptom → resource → logs → fix → verify
No pods at all → k describe rs -n <ns> — quota or SA errors surface here
Pods pending → k describe pod -n <ns> -l <label> — scheduler/SA/node issues
Pods running but broken → k logs + k describe pod events
Control plane broken → journalctl -u kubelet or crictl logs
Apiserver Troubleshooting Link to heading
Break Types (progressive) Link to heading
| Symptom | Cause | Tool |
|---|---|---|
| Bad YAML at top of manifest | No container spawns | journalctl -u kubelet |
| Unknown flag / parse error | Container fails to start | crictl logs <id> |
| Bad cert/endpoint | Starts but can’t communicate | crictl logs <id> |
Key: Bad YAML = no container = crictl logs useless → go straight to journalctl -u kubelet
Move Trick (force kubelet to restart static pod) Link to heading
cd /etc/kubernetes/manifests
mv kube-apiserver.yaml ..
sleep 5
mv ../kube-apiserver.yaml .
Works for kube-controller-manager and kube-scheduler too.
Log Paths Link to heading
# Apiserver container logs
crictl logs $(crictl ps | grep apiserver | awk '{print $1}')
/var/log/pods/kube-system_kube-apiserver-*/kube-apiserver/*
# Kubelet logs (when no container exists)
journalctl -u kubelet | tail -50
# Verify cert
find /etc/kubernetes/pki/ | grep apiserver.crt
Kubelet Troubleshooting Link to heading
Two Config Files Link to heading
| File | Contains |
|---|---|
/var/lib/kubelet/kubeadm-flags.env | Runtime flags passed to kubelet |
/var/lib/kubelet/config.yaml | kubelet configuration (apiVersion, etc.) |
Trap: unknown flag error with NO filename in logs → always kubeadm-flags.env
Trap: commented-out #apiVersion in config.yaml → remove the #
Kubelet Restart Link to heading
systemctl daemon-reload
systemctl restart kubelet
systemctl status kubelet
Deployment Troubleshooting Link to heading
Pods Not Starting Link to heading
k get po -n <ns> # check status
k describe po -n <ns> -l app=<label> # use label selector, not hash
k describe rs -n <ns> # quota/SA errors surface here
Common Issues Link to heading
| Symptom | Cause | Fix |
|---|---|---|
| Pods Pending, no pods | SA not found | k get sa -n <ns> → fix serviceAccountName in deploy |
| Pods Pending | Wrong nodeName | Remove nodeName from deploy spec |
| Pods Pending | ResourceQuota | k describe rs → edit quota |
| ConfigMap env error | Wrong CM name | k describe po events → fix configMapKeyRef.name |
Rule: k describe pod first for running pod issues. k describe rs for “no pods created” issues.
Label selector: k describe po -n <ns> -l app=<label> — avoids hash problem
HPA / ResourceQuota Link to heading
# HPA blocked by quota
k describe rs -n <ns> # shows quota error, NOT k describe deploy
k get quota -n <ns>
k edit quota <name> -n <ns> # update pods AND limits.cpu
k rollout restart deploy <name> -n <ns>
Multi-Container / Port Conflicts Link to heading
- Containers in same pod share network namespace
- Two containers cannot bind the same port
- Fix: change image or args to use different port
k logs --all-containers deploy/<name> -n <ns> > /root/logs.log- Port config varies by image: check args vs env vs config file
crictl Commands Link to heading
crictl ps # list running containers
crictl ps -a # include stopped
crictl logs <container-id> # stdout only
crictl logs <container-id> 2>&1 # stdout + stderr
crictl rm --force <container-id> # remove (triggers DaemonSet recreation)
crictl stop <container-id> # stop only (does NOT trigger recreation)
Trap: crictl logs stderr requires 2>&1 to redirect to file
Trap: Use crictl rm --force not crictl stop to trigger DaemonSet restart event
Node Troubleshooting Link to heading
k get node # check Ready status
k describe node <name> # check conditions, taints, capacity
ssh <node>
systemctl status kubelet
journalctl -u kubelet | tail -50
kubectl auth can-i Link to heading
# Test user permissions
k auth can-i <verb> <resource> --as <username> -n <ns>
# Test SA permissions (full format required)
k auth can-i <verb> <resource> --as system:serviceaccount:<ns>:<sa-name> -n <ns>
# Examples
k auth can-i delete pods --as smoke -n ops
k auth can-i create configmaps --as system:serviceaccount:operator:resource-manager -n default