CKA Troubleshooting Master Reference Link to heading
THE UNIVERSAL DECISION TREE Link to heading
kubectl works?
├── YES → use kubectl logs, describe, get events
└── NO → SSH to node → sudo crictl ps -a → sudo crictl logs <id>
└── check /etc/kubernetes/manifests/ for static pod break
crictl ALWAYS needs sudo — no exceptions.
SCENARIO 1 — Application Failure Link to heading
Mental model: Start at the user-facing end, work inward.
External request → Service → Pod → downstream Service → downstream Pod
# 1. Hit the service externally
curl http://<node-ip>:<nodeport>
# 2. Check service → pod wiring
k get endpoints <svc> -n <ns> # <none> = selector mismatch
k get pods -n <ns> --show-labels # compare labels vs svc selector
k edit svc <svc> -n <ns> # fix selector
# 3. Check pod health
k get pods -n <ns> # STATUS + RESTARTS
k describe pod <pod> -n <ns> # Events section
k logs <pod> -n <ns> # current logs
k logs <pod> -n <ns> --previous # crash logs (CrashLoopBackOff)
# 4. Check targetPort vs containerPort
k describe svc <svc> -n <ns> # TargetPort field
k describe pod <pod> -n <ns> # containerPort field
# 5. Cluster-wide events
k get events --sort-by=.metadata.creationTimestamp -n <ns>
Pod status decoder:
| Status | Meaning | Fix |
|---|---|---|
CrashLoopBackOff | Keeps crashing | logs --previous |
ImagePullBackOff | Can’t pull image | Check name/tag/registry secret |
OOMKilled | Hit memory limit | Increase limit or fix leak |
Pending | Not scheduled | Check resources, taints, PVC |
Running | Started — NOT necessarily healthy | Check RESTARTS + logs |
Traps:
Running≠ healthy — always check RESTARTS- Selector mismatch is the #1 service bug
targetPortmust match container’scontainerPort- Use
--previousafter a crash — current logs start fresh
SCENARIO 2 — Control Plane Failure Link to heading
Mental model: Cluster management is broken. kubectl may not work. etcd sick = apiserver sick even if pod shows Running.
# 1. Check nodes and kube-system pods
k get nodes
k get pods -n kube-system
# 2. kubectl logs (kubeadm clusters — static pods)
k logs kube-apiserver-<node> -n kube-system
k logs kube-controller-manager-<node> -n kube-system
k logs kube-scheduler-<node> -n kube-system
k logs etcd-<node> -n kube-system
# 3. journalctl (hard-way / binary clusters)
journalctl -u kube-apiserver
journalctl -u kube-controller-manager
journalctl -u kube-scheduler
journalctl -u etcd
# 4. kubectl is DEAD — drop to crictl
sudo crictl ps -a
sudo crictl logs <container-id>
# 5. Static pod manifests (kubeadm) — edit here, kubelet auto-restarts
ls /etc/kubernetes/manifests/
# kube-apiserver.yaml
# kube-controller-manager.yaml
# kube-scheduler.yaml
# etcd.yaml
# 6. Fix the manifest
vim /etc/kubernetes/manifests/kube-apiserver.yaml
# 7. If kubelet doesn't pick up the change — move out and back
mv /etc/kubernetes/manifests/kube-apiserver.yaml /tmp/
mv /tmp/kube-apiserver.yaml /etc/kubernetes/manifests/
# 8. Watch recovery
watch crictl ps # wait for container to appear
k get pods -n kube-system # verify
# 9. Useful log path when crictl logs is unavailable
cat /var/log/pods/kube-system_kube-apiserver-*/kube-apiserver/*
find /etc/kubernetes/pki/ | grep apiserver.crt # verify cert paths
Three progressive break patterns (drilled on KillerCoda):
| Break | Symptom | Where to look |
|---|---|---|
| Bad flag/line at bottom of kube-apiserver.yaml | Container crashes and restarts | sudo crictl ps -a → sudo crictl logs <id> — shows flag/parse error |
| Bad etcd endpoint/cert line | apiserver starts but can’t reach etcd — “unable to communicate” | sudo crictl logs <id> — shows connection refused or cert error |
| Bad pod YAML at top | kubelet can’t parse manifest — NO container spawns at all, crictl shows nothing | journalctl -u kubelet — shows parse error; crictl is useless here |
Key insight: Bad YAML = no container = crictl logs won’t help. Go straight to journalctl -u kubelet.
Traps:
- kubeadm = static pods =
kubectl logs; hard-way = systemd =journalctl - etcd down → apiserver degraded even if apiserver container is Running
- kube-proxy is a DaemonSet pod, NOT a systemd service
- kubectl dead →
crictlon the node, not on your laptop
SCENARIO 3 — Worker Node Failure Link to heading
Mental model: Node goes NotReady or Unknown. Control plane intact — kubectl works. Unknown = lost heartbeat (check OS). NotReady = OS up but kubelet has a problem.
# 1. From control plane
k get nodes
k describe node <node> # Conditions + LastHeartbeatTime
# 2. SSH to the node
ssh <node>
# 3. Check resources
df -h && free -m && top
# 4. Kubelet
systemctl status kubelet
journalctl -u kubelet # look for the actual error
sudo systemctl restart kubelet
sudo systemctl enable kubelet # ← EXAM TRAP: must enable or it won't survive reboot
# 5. After fix — verify from control plane
k get nodes # should flip to Ready
Node conditions (concern when True):
OutOfDisk | MemoryPressure | DiskPressure | PIDPressure | Ready=False/Unknown
Traps:
Unknown= lost heartbeat — check if OS is even up firstNotReady= OS up, kubelet broken — check logs- Missing
systemctl enable kubelet= node reboots, goes NotReady again - Expired kubelet cert = can’t authenticate to apiserver → NotReady
Kubelet cert check:
openssl x509 -in /var/lib/kubelet/pki/kubelet-client-current.pem -text -noout
# Check: Not After, O=system:nodes, CN=system:node:<nodename>
SCENARIO 4 — Network Troubleshooting Link to heading
Mental model: Three layers — CNI (pod-to-pod), kube-proxy (service routing), CoreDNS (name resolution). Test pod-to-pod first to isolate the layer.
# LAYER 1 — CNI (pod-to-pod)
k get pods -n kube-system | grep -i <cni> # weave, calico, flannel
k logs <cni-pod> -n kube-system
ls /etc/cni/net.d/
# Test pod-to-pod directly (bypass service)
k get pods -o wide # get pod IP
k run tmp -it --rm --restart=Never --image=busybox -- sh
# inside: wget -qO- <pod-ip>:<port>
# pod-to-pod FAILS → CNI problem
# pod-to-pod WORKS → service/kube-proxy problem
# LAYER 2 — kube-proxy (service routing)
k get pods -n kube-system -l k8s-app=kube-proxy
k logs <kube-proxy-pod> -n kube-system
k get configmap kube-proxy -n kube-system -o yaml # check mode, clusterCIDR
# LAYER 3 — CoreDNS
k get pods -n kube-system -l k8s-app=kube-dns
k get endpoints kube-dns -n kube-system # empty = selector mismatch
k describe configmap coredns -n kube-system # check Corefile
# Test DNS from a pod
k exec -it <pod> -- nslookup <svc>.<ns>.svc.cluster.local
k exec -it <pod> -- cat /etc/resolv.conf
Traps:
- No CNI = pods can’t communicate even if Running
- kube-proxy down = services don’t route even if pods healthy
- CoreDNS endpoints empty = selector mismatch on kube-dns service
- Test pod-to-pod first — isolates the layer before assuming service bug
SCENARIO 5 — HPA Not Scaling / Deployment Can’t Reach Replica Count Link to heading
Mental model: HPA wants more pods but something is blocking new pod creation. Check ReplicaSet events — not the Deployment — that’s where the scheduler/quota errors surface.
# 1. Check HPA status
k get hpa -n <ns>
k describe hpa <name> -n <ns>
# 2. Check ReplicaSet events (NOT deployment)
k describe rs -n <ns> # events show the actual error
# 3. If ResourceQuota is the blocker
k get quota -n <ns>
k describe quota <name> -n <ns> # shows hard limits vs used
# 4. Edit the quota to allow enough pods
k edit quota <name> -n <ns>
# spec:
# hard:
# pods: "20" ← update to HPA maximum or higher
# limits.cpu: "10" ← may also need updating if CPU quota is blocking
# 5. Optionally restart deployment for faster scaling
k rollout restart deployment <name> -n <ns>
Traps:
k describe deploymentwon’t show quota errors — check the ReplicaSet- Quota blocks at namespace level — even if node has capacity, pods won’t schedule
- Set quota
podsto at least the HPAmaxReplicasvalue
Verify HPA is working — trigger CPU load:
# Exec into pods and run stress (image must support it e.g. polinux/stress)
k exec <pod1> -n <ns> -- stress --cpu 1 &
k exec <pod2> -n <ns> -- stress --cpu 1 &
k exec <pod3> -n <ns> -- stress --cpu 1 &
# Check CPU usage
k top pod -n <ns>
# Watch HPA scale
k get hpa <name> -n <ns> -w
SCENARIO 6 — Multi-Container Issues Link to heading
Part A — Gather logs from all containers Link to heading
# Per container (append to same file)
k logs deploy/<name> -c <container1> -n <ns> >> /root/logs.log
k logs deploy/<name> -c <container2> -n <ns> >> /root/logs.log
# Or all containers in one shot
k logs --all-containers deploy/<name> -n <ns> > /root/logs.log
Trap: > overwrites, >> appends. When gathering multiple containers separately use >>.
Part B — Port conflict between containers Link to heading
Mental model: Two containers in same pod share the network namespace — they share localhost. If both try to listen on port 80, one wins, one crashes. First to start wins.
# Diagnose
k logs deploy/<name> -c <container> -n <ns> # look for "address already in use"
# Fix — edit deployment, change conflicting container image/port
k edit deploy <name> -n <ns>
# spec.containers[1]:
# image: traefik/whoami:v1.11
# args:
# - --port
# - "8080"
# Verify
k get deploy <name> -n <ns>
Trap: Containers in the same pod share the network namespace — port conflicts are possible and will cause CrashLoopBackOff.
SCENARIO 7 — Control Plane Component: Unknown Flag Link to heading
Mental model: Static pod is crashing due to unknown/bad flag in the manifest. kubectl logs works if the container is cycling — check logs first, then edit manifest.
# 1. Check pod status
k get pods -n kube-system
# 2. Check logs (pod name includes node name)
k logs kube-controller-manager-<node> -n kube-system
# Error: unknown flag: --some-bad-flag
# 3. Fix the manifest — remove the bad flag
vim /etc/kubernetes/manifests/kube-controller-manager.yaml
# 4. Force restart if needed (move trick)
cd /etc/kubernetes/manifests
mv kube-controller-manager.yaml ..
sleep 5
mv ../kube-controller-manager.yaml .
# 5. Wait for pod to come back
watch crictl ps
k get pods -n kube-system
Trap: Same move trick applies to any static pod — scheduler, controller-manager, etcd.
SCENARIO 8 — Kubelet Broken: Two Errors in Different Files Link to heading
Mental model: Kubelet won’t start — fix first error, restart, check logs again. There may be a second error in a DIFFERENT config file. Fix-restart-check loop until clean.
# 1. SSH to node
ssh node01
# 2. Check kubelet status
systemctl status kubelet
# 3. Check logs
journalctl -u kubelet
# Error 1: unknown flag in kubeadm flags env file
# Fix: edit /var/lib/kubelet/kubeadm-flags.env
vim /var/lib/kubelet/kubeadm-flags.env
# KUBELET_KUBEADM_ARGS="" ← remove the bad flag from this string
# 4. Restart and check again
systemctl restart kubelet
journalctl -u kubelet
# Error 2: apiVersion commented out in config.yaml
# Fix: edit /var/lib/kubelet/config.yaml
vim /var/lib/kubelet/config.yaml
# #apiVersion: kubelet.config.k8s.io/v1beta1 ← uncomment this
# 5. Restart and verify
systemctl restart kubelet
systemctl status kubelet
Kubelet config files:
| File | Purpose |
|---|---|
/var/lib/kubelet/kubeadm-flags.env | Flags passed to kubelet by kubeadm |
/var/lib/kubelet/config.yaml | Kubelet configuration (apiVersion, clusterDNS, etc.) |
Traps:
- One error fixed doesn’t mean kubelet is clean — always restart and check logs again
- Two errors can be in two different files
#apiVersioncommented out = kubelet can’t parse its own configunknown flag: --xxxwith NO file name in the error → it’s in/var/lib/kubelet/kubeadm-flags.env— memorize this
SCENARIO 9 — Applications Misconfigured (3-Task Bundle) Link to heading
Mental model: Three independent app-level misconfigurations bundled in one scenario. Cluster itself is healthy — each task is a separate pod/deployment spec error. Diagnose-fix-verify each independently.
# Task 1 — Wrong ConfigMap reference
k get cm -n <ns> # find actual CM name
k describe pod <pod> -n <ns> # Events show CreateContainerConfigError
# Fix: edit deploy → spec.containers[].env[].valueFrom.configMapKeyRef.name
# Task 2 — Hardcoded nodeName blocking scheduling
k get deploy <name> -n <ns> -o yaml | grep nodeName
# Fix: remove spec.template.spec.nodeName entirely — let scheduler place it
# Task 3 — Wrong ServiceAccount name
k get sa -n <ns> # find actual SA name
k describe pod <pod> -n <ns> # Events show SA not found
# Fix: edit deploy → spec.template.spec.serviceAccountName
New diagnostic patterns locked:
- No pods created at all →
k describe rs -n <ns>, notk describe pod - Pods stuck Pending →
k describe pod -n <ns> -l app=<label>— label selector avoids the hash-suffix problem - For app-level issues:
describe pod/describe rs>describe deploy— deploy events rarely show the root cause
Traps:
CreateContainerConfigError= ConfigMap/Secret reference mismatch — check actual names withk get cm/k get secret- A hardcoded
nodeNamebypasses the scheduler entirely — if that node can’t run it, pod sits Pending forever with no scheduler events - ServiceAccount typos produce no obvious error in
k get pods— must check Events
LOG LOCATIONS QUICK REF Link to heading
# On any node — readable even without crictl
ls /var/log/pods/
ls /var/log/containers/
# Kubelet
journalctl -u kubelet
cat /var/log/syslog | grep kubelet # fallback
# Control plane (kubeadm)
k logs kube-apiserver-<node> -n kube-system
k logs kube-scheduler-<node> -n kube-system
k logs kube-controller-manager-<node> -n kube-system
k logs etcd-<node> -n kube-system
# When kubectl is dead
sudo crictl ps -a
sudo crictl logs <container-id>
PATCH COMMANDS Link to heading
# Generic patch — strategic merge
k patch <resource> <name> -n <ns> --patch '{"spec": {"field": "value"}}'
# Who this fuck be on the right now, broPatch deployment image
k patch deployment <name> -n <ns> \
--patch '{"spec":{"template":{"spec":{"containers":[{"name":"<cname>","image":"<image>"}]}}}}'
# Patch with type flag
k patch deployment <name> -n <ns> --type=merge \
--patch '{"spec":{"replicas":3}}'
# Patch strategic (default) — merges arrays by key
k patch deployment <name> -n <ns> --type=strategic \
--patch '{"spec":{"template":{"spec":{"containers":[{"name":"<cname>","resources":{"requests":{"cpu":"100m"}}}]}}}}'
# Patch JSON patch — precise array index ops
k patch deployment <name> -n <ns> --type=json \
--patch '[{"op":"replace","path":"/spec/replicas","value":3}]'
# Patch a node label
k patch node <node> --type=merge --patch '{"metadata":{"labels":{"disk":"ssd"}}}'
# Patch service type
k patch svc <svc> -n <ns> --type=merge \
--patch '{"spec":{"type":"ClusterIP"}}'
Patch types:
| Type | Use case |
|---|---|
strategic (default) | Deployments, pods — merges arrays by name key |
merge | Simple field updates — replaces arrays entirely |
json | Precise path-based ops — add, remove, replace, move |
EXAM REFLEXES — CHEAT SHEET Link to heading
kubectl dead? → SSH → sudo crictl ps -a → sudo crictl logs <id>
Node NotReady? → SSH → systemctl status kubelet → journalctl -u kubelet → restart + enable
App not reachable? → k get endpoints → check selector → check targetPort
Pod CrashLoopBackOff? → k logs --previous
Static pod broken? → /etc/kubernetes/manifests/<component>.yaml → edit → watch crictl ps
DNS broken? → k get endpoints kube-dns -n kube-system → check CoreDNS pods
Pod-to-pod fails? → CNI layer → k get pods -n kube-system | grep cni
Time-box rule: Any question > 10 minutes → flag and move. Return at end.