CKA Troubleshooting Master Reference Link to heading


THE UNIVERSAL DECISION TREE Link to heading

kubectl works?
├── YES → use kubectl logs, describe, get events
└── NO → SSH to node → sudo crictl ps -a → sudo crictl logs <id>
 └── check /etc/kubernetes/manifests/ for static pod break

crictl ALWAYS needs sudo — no exceptions.


SCENARIO 1 — Application Failure Link to heading

Mental model: Start at the user-facing end, work inward. External request → Service → Pod → downstream Service → downstream Pod

# 1. Hit the service externally
curl http://<node-ip>:<nodeport>

# 2. Check service → pod wiring
k get endpoints <svc> -n <ns> # <none> = selector mismatch
k get pods -n <ns> --show-labels # compare labels vs svc selector
k edit svc <svc> -n <ns> # fix selector

# 3. Check pod health
k get pods -n <ns> # STATUS + RESTARTS
k describe pod <pod> -n <ns> # Events section
k logs <pod> -n <ns> # current logs
k logs <pod> -n <ns> --previous # crash logs (CrashLoopBackOff)

# 4. Check targetPort vs containerPort
k describe svc <svc> -n <ns> # TargetPort field
k describe pod <pod> -n <ns> # containerPort field

# 5. Cluster-wide events
k get events --sort-by=.metadata.creationTimestamp -n <ns>

Pod status decoder:

StatusMeaningFix
CrashLoopBackOffKeeps crashinglogs --previous
ImagePullBackOffCan’t pull imageCheck name/tag/registry secret
OOMKilledHit memory limitIncrease limit or fix leak
PendingNot scheduledCheck resources, taints, PVC
RunningStarted — NOT necessarily healthyCheck RESTARTS + logs

Traps:

  • Running ≠ healthy — always check RESTARTS
  • Selector mismatch is the #1 service bug
  • targetPort must match container’s containerPort
  • Use --previous after a crash — current logs start fresh

SCENARIO 2 — Control Plane Failure Link to heading

Mental model: Cluster management is broken. kubectl may not work. etcd sick = apiserver sick even if pod shows Running.

# 1. Check nodes and kube-system pods
k get nodes
k get pods -n kube-system

# 2. kubectl logs (kubeadm clusters — static pods)
k logs kube-apiserver-<node> -n kube-system
k logs kube-controller-manager-<node> -n kube-system
k logs kube-scheduler-<node> -n kube-system
k logs etcd-<node> -n kube-system

# 3. journalctl (hard-way / binary clusters)
journalctl -u kube-apiserver
journalctl -u kube-controller-manager
journalctl -u kube-scheduler
journalctl -u etcd

# 4. kubectl is DEAD — drop to crictl
sudo crictl ps -a
sudo crictl logs <container-id>

# 5. Static pod manifests (kubeadm) — edit here, kubelet auto-restarts
ls /etc/kubernetes/manifests/
# kube-apiserver.yaml
# kube-controller-manager.yaml
# kube-scheduler.yaml
# etcd.yaml

# 6. Fix the manifest
vim /etc/kubernetes/manifests/kube-apiserver.yaml

# 7. If kubelet doesn't pick up the change — move out and back
mv /etc/kubernetes/manifests/kube-apiserver.yaml /tmp/
mv /tmp/kube-apiserver.yaml /etc/kubernetes/manifests/

# 8. Watch recovery
watch crictl ps # wait for container to appear
k get pods -n kube-system # verify

# 9. Useful log path when crictl logs is unavailable
cat /var/log/pods/kube-system_kube-apiserver-*/kube-apiserver/*
find /etc/kubernetes/pki/ | grep apiserver.crt # verify cert paths

Three progressive break patterns (drilled on KillerCoda):

BreakSymptomWhere to look
Bad flag/line at bottom of kube-apiserver.yamlContainer crashes and restartssudo crictl ps -asudo crictl logs <id> — shows flag/parse error
Bad etcd endpoint/cert lineapiserver starts but can’t reach etcd — “unable to communicate”sudo crictl logs <id> — shows connection refused or cert error
Bad pod YAML at topkubelet can’t parse manifest — NO container spawns at all, crictl shows nothingjournalctl -u kubelet — shows parse error; crictl is useless here

Key insight: Bad YAML = no container = crictl logs won’t help. Go straight to journalctl -u kubelet.

Traps:

  • kubeadm = static pods = kubectl logs; hard-way = systemd = journalctl
  • etcd down → apiserver degraded even if apiserver container is Running
  • kube-proxy is a DaemonSet pod, NOT a systemd service
  • kubectl dead → crictl on the node, not on your laptop

SCENARIO 3 — Worker Node Failure Link to heading

Mental model: Node goes NotReady or Unknown. Control plane intact — kubectl works. Unknown = lost heartbeat (check OS). NotReady = OS up but kubelet has a problem.

# 1. From control plane
k get nodes
k describe node <node> # Conditions + LastHeartbeatTime

# 2. SSH to the node
ssh <node>

# 3. Check resources
df -h && free -m && top

# 4. Kubelet
systemctl status kubelet
journalctl -u kubelet # look for the actual error
sudo systemctl restart kubelet
sudo systemctl enable kubelet # ← EXAM TRAP: must enable or it won't survive reboot

# 5. After fix — verify from control plane
k get nodes # should flip to Ready

Node conditions (concern when True): OutOfDisk | MemoryPressure | DiskPressure | PIDPressure | Ready=False/Unknown

Traps:

  • Unknown = lost heartbeat — check if OS is even up first
  • NotReady = OS up, kubelet broken — check logs
  • Missing systemctl enable kubelet = node reboots, goes NotReady again
  • Expired kubelet cert = can’t authenticate to apiserver → NotReady

Kubelet cert check:

openssl x509 -in /var/lib/kubelet/pki/kubelet-client-current.pem -text -noout
# Check: Not After, O=system:nodes, CN=system:node:<nodename>

SCENARIO 4 — Network Troubleshooting Link to heading

Mental model: Three layers — CNI (pod-to-pod), kube-proxy (service routing), CoreDNS (name resolution). Test pod-to-pod first to isolate the layer.

# LAYER 1 — CNI (pod-to-pod)
k get pods -n kube-system | grep -i <cni> # weave, calico, flannel
k logs <cni-pod> -n kube-system
ls /etc/cni/net.d/

# Test pod-to-pod directly (bypass service)
k get pods -o wide # get pod IP
k run tmp -it --rm --restart=Never --image=busybox -- sh
# inside: wget -qO- <pod-ip>:<port>
# pod-to-pod FAILS → CNI problem
# pod-to-pod WORKS → service/kube-proxy problem

# LAYER 2 — kube-proxy (service routing)
k get pods -n kube-system -l k8s-app=kube-proxy
k logs <kube-proxy-pod> -n kube-system
k get configmap kube-proxy -n kube-system -o yaml # check mode, clusterCIDR

# LAYER 3 — CoreDNS
k get pods -n kube-system -l k8s-app=kube-dns
k get endpoints kube-dns -n kube-system # empty = selector mismatch
k describe configmap coredns -n kube-system # check Corefile

# Test DNS from a pod
k exec -it <pod> -- nslookup <svc>.<ns>.svc.cluster.local
k exec -it <pod> -- cat /etc/resolv.conf

Traps:

  • No CNI = pods can’t communicate even if Running
  • kube-proxy down = services don’t route even if pods healthy
  • CoreDNS endpoints empty = selector mismatch on kube-dns service
  • Test pod-to-pod first — isolates the layer before assuming service bug

SCENARIO 5 — HPA Not Scaling / Deployment Can’t Reach Replica Count Link to heading

Mental model: HPA wants more pods but something is blocking new pod creation. Check ReplicaSet events — not the Deployment — that’s where the scheduler/quota errors surface.

# 1. Check HPA status
k get hpa -n <ns>
k describe hpa <name> -n <ns>

# 2. Check ReplicaSet events (NOT deployment)
k describe rs -n <ns> # events show the actual error

# 3. If ResourceQuota is the blocker
k get quota -n <ns>
k describe quota <name> -n <ns> # shows hard limits vs used

# 4. Edit the quota to allow enough pods
k edit quota <name> -n <ns>
# spec:
# hard:
# pods: "20" ← update to HPA maximum or higher
# limits.cpu: "10" ← may also need updating if CPU quota is blocking

# 5. Optionally restart deployment for faster scaling
k rollout restart deployment <name> -n <ns>

Traps:

  • k describe deployment won’t show quota errors — check the ReplicaSet
  • Quota blocks at namespace level — even if node has capacity, pods won’t schedule
  • Set quota pods to at least the HPA maxReplicas value

Verify HPA is working — trigger CPU load:

# Exec into pods and run stress (image must support it e.g. polinux/stress)
k exec <pod1> -n <ns> -- stress --cpu 1 &
k exec <pod2> -n <ns> -- stress --cpu 1 &
k exec <pod3> -n <ns> -- stress --cpu 1 &

# Check CPU usage
k top pod -n <ns>

# Watch HPA scale
k get hpa <name> -n <ns> -w

SCENARIO 6 — Multi-Container Issues Link to heading

Part A — Gather logs from all containers Link to heading

# Per container (append to same file)
k logs deploy/<name> -c <container1> -n <ns> >> /root/logs.log
k logs deploy/<name> -c <container2> -n <ns> >> /root/logs.log

# Or all containers in one shot
k logs --all-containers deploy/<name> -n <ns> > /root/logs.log

Trap: > overwrites, >> appends. When gathering multiple containers separately use >>.

Part B — Port conflict between containers Link to heading

Mental model: Two containers in same pod share the network namespace — they share localhost. If both try to listen on port 80, one wins, one crashes. First to start wins.

# Diagnose
k logs deploy/<name> -c <container> -n <ns> # look for "address already in use"

# Fix — edit deployment, change conflicting container image/port
k edit deploy <name> -n <ns>
# spec.containers[1]:
# image: traefik/whoami:v1.11
# args:
# - --port
# - "8080"

# Verify
k get deploy <name> -n <ns>

Trap: Containers in the same pod share the network namespace — port conflicts are possible and will cause CrashLoopBackOff.


SCENARIO 7 — Control Plane Component: Unknown Flag Link to heading

Mental model: Static pod is crashing due to unknown/bad flag in the manifest. kubectl logs works if the container is cycling — check logs first, then edit manifest.

# 1. Check pod status
k get pods -n kube-system

# 2. Check logs (pod name includes node name)
k logs kube-controller-manager-<node> -n kube-system
# Error: unknown flag: --some-bad-flag

# 3. Fix the manifest — remove the bad flag
vim /etc/kubernetes/manifests/kube-controller-manager.yaml

# 4. Force restart if needed (move trick)
cd /etc/kubernetes/manifests
mv kube-controller-manager.yaml ..
sleep 5
mv ../kube-controller-manager.yaml .

# 5. Wait for pod to come back
watch crictl ps
k get pods -n kube-system

Trap: Same move trick applies to any static pod — scheduler, controller-manager, etcd.


SCENARIO 8 — Kubelet Broken: Two Errors in Different Files Link to heading

Mental model: Kubelet won’t start — fix first error, restart, check logs again. There may be a second error in a DIFFERENT config file. Fix-restart-check loop until clean.

# 1. SSH to node
ssh node01

# 2. Check kubelet status
systemctl status kubelet

# 3. Check logs
journalctl -u kubelet

# Error 1: unknown flag in kubeadm flags env file
# Fix: edit /var/lib/kubelet/kubeadm-flags.env
vim /var/lib/kubelet/kubeadm-flags.env
# KUBELET_KUBEADM_ARGS="" ← remove the bad flag from this string

# 4. Restart and check again
systemctl restart kubelet
journalctl -u kubelet

# Error 2: apiVersion commented out in config.yaml
# Fix: edit /var/lib/kubelet/config.yaml
vim /var/lib/kubelet/config.yaml
# #apiVersion: kubelet.config.k8s.io/v1beta1 ← uncomment this

# 5. Restart and verify
systemctl restart kubelet
systemctl status kubelet

Kubelet config files:

FilePurpose
/var/lib/kubelet/kubeadm-flags.envFlags passed to kubelet by kubeadm
/var/lib/kubelet/config.yamlKubelet configuration (apiVersion, clusterDNS, etc.)

Traps:

  • One error fixed doesn’t mean kubelet is clean — always restart and check logs again
  • Two errors can be in two different files
  • #apiVersion commented out = kubelet can’t parse its own config
  • unknown flag: --xxx with NO file name in the error → it’s in /var/lib/kubelet/kubeadm-flags.env — memorize this

SCENARIO 9 — Applications Misconfigured (3-Task Bundle) Link to heading

Mental model: Three independent app-level misconfigurations bundled in one scenario. Cluster itself is healthy — each task is a separate pod/deployment spec error. Diagnose-fix-verify each independently.

# Task 1 — Wrong ConfigMap reference
k get cm -n <ns>                          # find actual CM name
k describe pod <pod> -n <ns>              # Events show CreateContainerConfigError
# Fix: edit deploy → spec.containers[].env[].valueFrom.configMapKeyRef.name

# Task 2 — Hardcoded nodeName blocking scheduling
k get deploy <name> -n <ns> -o yaml | grep nodeName
# Fix: remove spec.template.spec.nodeName entirely — let scheduler place it

# Task 3 — Wrong ServiceAccount name
k get sa -n <ns>                          # find actual SA name
k describe pod <pod> -n <ns>              # Events show SA not found
# Fix: edit deploy → spec.template.spec.serviceAccountName

New diagnostic patterns locked:

  • No pods created at all → k describe rs -n <ns>, not k describe pod
  • Pods stuck Pending → k describe pod -n <ns> -l app=<label> — label selector avoids the hash-suffix problem
  • For app-level issues: describe pod/describe rs > describe deploy — deploy events rarely show the root cause

Traps:

  • CreateContainerConfigError = ConfigMap/Secret reference mismatch — check actual names with k get cm/k get secret
  • A hardcoded nodeName bypasses the scheduler entirely — if that node can’t run it, pod sits Pending forever with no scheduler events
  • ServiceAccount typos produce no obvious error in k get pods — must check Events

LOG LOCATIONS QUICK REF Link to heading

# On any node — readable even without crictl
ls /var/log/pods/
ls /var/log/containers/

# Kubelet
journalctl -u kubelet
cat /var/log/syslog | grep kubelet # fallback

# Control plane (kubeadm)
k logs kube-apiserver-<node> -n kube-system
k logs kube-scheduler-<node> -n kube-system
k logs kube-controller-manager-<node> -n kube-system
k logs etcd-<node> -n kube-system

# When kubectl is dead
sudo crictl ps -a
sudo crictl logs <container-id>

PATCH COMMANDS Link to heading

# Generic patch — strategic merge
k patch <resource> <name> -n <ns> --patch '{"spec": {"field": "value"}}'

# Who this fuck be on the right now, broPatch deployment image
k patch deployment <name> -n <ns> \
 --patch '{"spec":{"template":{"spec":{"containers":[{"name":"<cname>","image":"<image>"}]}}}}'

# Patch with type flag
k patch deployment <name> -n <ns> --type=merge \
 --patch '{"spec":{"replicas":3}}'

# Patch strategic (default) — merges arrays by key
k patch deployment <name> -n <ns> --type=strategic \
 --patch '{"spec":{"template":{"spec":{"containers":[{"name":"<cname>","resources":{"requests":{"cpu":"100m"}}}]}}}}'

# Patch JSON patch — precise array index ops
k patch deployment <name> -n <ns> --type=json \
 --patch '[{"op":"replace","path":"/spec/replicas","value":3}]'

# Patch a node label
k patch node <node> --type=merge --patch '{"metadata":{"labels":{"disk":"ssd"}}}'

# Patch service type
k patch svc <svc> -n <ns> --type=merge \
 --patch '{"spec":{"type":"ClusterIP"}}'

Patch types:

TypeUse case
strategic (default)Deployments, pods — merges arrays by name key
mergeSimple field updates — replaces arrays entirely
jsonPrecise path-based ops — add, remove, replace, move

EXAM REFLEXES — CHEAT SHEET Link to heading

kubectl dead? → SSH → sudo crictl ps -a → sudo crictl logs <id>
Node NotReady? → SSH → systemctl status kubelet → journalctl -u kubelet → restart + enable
App not reachable? → k get endpoints → check selector → check targetPort
Pod CrashLoopBackOff? → k logs --previous
Static pod broken? → /etc/kubernetes/manifests/<component>.yaml → edit → watch crictl ps
DNS broken? → k get endpoints kube-dns -n kube-system → check CoreDNS pods
Pod-to-pod fails? → CNI layer → k get pods -n kube-system | grep cni

Time-box rule: Any question > 10 minutes → flag and move. Return at end.