CKA Troubleshooting Master Reference Link to heading

Domain 4 — Cluster and Node Troubleshooting Link to heading

Mental Model Link to heading

Troubleshooting flow: symptom → resource → logs → fix → verify

No pods at allk describe rs -n <ns> — quota or SA errors surface here Pods pendingk describe pod -n <ns> -l <label> — scheduler/SA/node issues Pods running but brokenk logs + k describe pod events Control plane brokenjournalctl -u kubelet or crictl logs


Apiserver Troubleshooting Link to heading

Break Types (progressive) Link to heading

SymptomCauseTool
Bad YAML at top of manifestNo container spawnsjournalctl -u kubelet
Unknown flag / parse errorContainer fails to startcrictl logs <id>
Bad cert/endpointStarts but can’t communicatecrictl logs <id>

Key: Bad YAML = no container = crictl logs useless → go straight to journalctl -u kubelet

Move Trick (force kubelet to restart static pod) Link to heading

cd /etc/kubernetes/manifests
mv kube-apiserver.yaml ..
sleep 5
mv ../kube-apiserver.yaml .

Works for kube-controller-manager and kube-scheduler too.

Log Paths Link to heading

# Apiserver container logs
crictl logs $(crictl ps | grep apiserver | awk '{print $1}')
/var/log/pods/kube-system_kube-apiserver-*/kube-apiserver/*

# Kubelet logs (when no container exists)
journalctl -u kubelet | tail -50

# Verify cert
find /etc/kubernetes/pki/ | grep apiserver.crt

Kubelet Troubleshooting Link to heading

Two Config Files Link to heading

FileContains
/var/lib/kubelet/kubeadm-flags.envRuntime flags passed to kubelet
/var/lib/kubelet/config.yamlkubelet configuration (apiVersion, etc.)

Trap: unknown flag error with NO filename in logs → always kubeadm-flags.env Trap: commented-out #apiVersion in config.yaml → remove the #

Kubelet Restart Link to heading

systemctl daemon-reload
systemctl restart kubelet
systemctl status kubelet

Deployment Troubleshooting Link to heading

Pods Not Starting Link to heading

k get po -n <ns>                          # check status
k describe po -n <ns> -l app=<label>      # use label selector, not hash
k describe rs -n <ns>                     # quota/SA errors surface here

Common Issues Link to heading

SymptomCauseFix
Pods Pending, no podsSA not foundk get sa -n <ns> → fix serviceAccountName in deploy
Pods PendingWrong nodeNameRemove nodeName from deploy spec
Pods PendingResourceQuotak describe rs → edit quota
ConfigMap env errorWrong CM namek describe po events → fix configMapKeyRef.name

Rule: k describe pod first for running pod issues. k describe rs for “no pods created” issues. Label selector: k describe po -n <ns> -l app=<label> — avoids hash problem


HPA / ResourceQuota Link to heading

# HPA blocked by quota
k describe rs -n <ns>    # shows quota error, NOT k describe deploy
k get quota -n <ns>
k edit quota <name> -n <ns>    # update pods AND limits.cpu
k rollout restart deploy <name> -n <ns>

Multi-Container / Port Conflicts Link to heading

  • Containers in same pod share network namespace
  • Two containers cannot bind the same port
  • Fix: change image or args to use different port
  • k logs --all-containers deploy/<name> -n <ns> > /root/logs.log
  • Port config varies by image: check args vs env vs config file

crictl Commands Link to heading

crictl ps                          # list running containers
crictl ps -a                       # include stopped
crictl logs <container-id>         # stdout only
crictl logs <container-id> 2>&1    # stdout + stderr
crictl rm --force <container-id>   # remove (triggers DaemonSet recreation)
crictl stop <container-id>         # stop only (does NOT trigger recreation)

Trap: crictl logs stderr requires 2>&1 to redirect to file Trap: Use crictl rm --force not crictl stop to trigger DaemonSet restart event


Node Troubleshooting Link to heading

k get node                         # check Ready status
k describe node <name>             # check conditions, taints, capacity
ssh <node>
systemctl status kubelet
journalctl -u kubelet | tail -50

kubectl auth can-i Link to heading

# Test user permissions
k auth can-i <verb> <resource> --as <username> -n <ns>

# Test SA permissions (full format required)
k auth can-i <verb> <resource> --as system:serviceaccount:<ns>:<sa-name> -n <ns>

# Examples
k auth can-i delete pods --as smoke -n ops
k auth can-i create configmaps --as system:serviceaccount:operator:resource-manager -n default