Kubernetes Troubleshooting คืออะไร? สอนแก้ปัญหา K8s ที่พบบ่อยสำหรับ DevOps 2026

Kubernetes (K8s) เป็น Container orchestration platform ที่ทรงพลังที่สุดในปัจจุบัน แต่ก็ซับซ้อนที่สุดเช่นกัน เมื่อเกิดปัญหา — Pod ไม่ Start, Service ไม่ตอบ, Ingress return 502, Node NotReady — การ Debug ใน K8s ต้องอาศัยความเข้าใจ Architecture และคำสั่ง kubectl ที่ถูกต้อง

บทความนี้จะสอน Systematic approach ในการ Troubleshoot Kubernetes ครอบคลุมปัญหาที่พบบ่อยที่สุด ตั้งแต่ Pod issues, Service issues, Ingress issues, Networking, Storage, Node issues ไปจนถึง RBAC

Systematic Troubleshooting Methodology

เมื่อเกิดปัญหาใน K8s ให้ทำตามลำดับ:

Check Pod status — kubectl get pods ดูว่า Pod อยู่ในสถานะอะไร
Describe resource — kubectl describe pod <name> ดู Events และ Conditions
Check logs — kubectl logs <pod> ดู Application logs
Check events — kubectl get events --sort-by=.lastTimestamp ดู Cluster events
Exec into pod — kubectl exec -it <pod> -- sh เข้าไปดูข้างใน
Check resources — kubectl top pods ดู CPU/Memory usage

Pod Issues — ปัญหา Pod ที่พบบ่อย

CrashLoopBackOff

Pod Start แล้ว Crash ซ้ำไปซ้ำมา K8s พยายาม Restart ตลอด โดย Backoff delay เพิ่มขึ้นเรื่อยๆ (10s, 20s, 40s, ... สูงสุด 5 นาที)

# ดูสถานะ
kubectl get pods
# NAME          READY   STATUS             RESTARTS   AGE
# web-app-xyz   0/1     CrashLoopBackOff   5          3m

# ดู Logs ของ Container ที่ Crash
kubectl logs web-app-xyz
kubectl logs web-app-xyz --previous    # Logs ของ Container ก่อนหน้าที่ Crash

# สาเหตุที่พบบ่อย:
# 1. Application error (Runtime exception, Missing config)
# 2. Missing environment variables
# 3. Cannot connect to database/external service
# 4. Wrong entrypoint/command in Dockerfile
# 5. Health check fails (Liveness probe kills the pod)
# 6. Permission issues (file/directory not writable)

# ตรวจสอบ:
kubectl describe pod web-app-xyz
# ดู "Events" section → เหตุผลที่ Restart
# ดู "Containers" → "Last State" → "Reason" / "Exit Code"

# Exit codes:
# 0 = Normal exit (แต่ K8s คิดว่าไม่ควรจบ → Restart)
# 1 = Application error
# 137 = OOMKilled (out of memory) หรือ SIGKILL
# 139 = Segfault
# 143 = SIGTERM (graceful shutdown)

Quick fix: ถ้า CrashLoopBackOff เกิดจาก Liveness probe ล้มเหลว (Application ยัง Start ไม่เสร็จ) → เพิ่ม initialDelaySeconds ใน Liveness probe หรือใช้ Startup probe แทน

ImagePullBackOff

K8s ไม่สามารถ Pull Docker image ได้:

# สาเหตุ:
# 1. Image name/tag ผิด (typo)
# 2. Image ไม่มีใน Registry
# 3. Registry ต้อง Authentication (Private registry)
# 4. Network issue (ไม่สามารถเชื่อมต่อ Registry)

# ตรวจสอบ:
kubectl describe pod web-app-xyz
# ดู Events: "Failed to pull image" + error message

# แก้ไข:
# 1. ตรวจ image name: kubectl get pod web-app-xyz -o jsonpath='{.spec.containers[0].image}'
# 2. ทดสอบ pull: docker pull <image-name>
# 3. ถ้า Private registry: สร้าง Secret
kubectl create secret docker-registry regcred     --docker-server=registry.example.com     --docker-username=user     --docker-password=pass
# แล้วเพิ่มใน Pod spec:
# imagePullSecrets:
#   - name: regcred

Pending

Pod ค้างอยู่ในสถานะ Pending ไม่ถูก Schedule ไปยัง Node ใดเลย:

# สาเหตุ:
# 1. Resource ไม่พอ (ไม่มี Node ที่มี CPU/Memory เพียงพอ)
# 2. Node selector/affinity ไม่ Match
# 3. Taints/Tolerations ไม่ถูกต้อง
# 4. PVC pending (ยังสร้าง Volume ไม่ได้)
# 5. ResourceQuota เต็ม

# ตรวจสอบ:
kubectl describe pod pending-pod
# ดู Events: "FailedScheduling" → บอกเหตุผลชัดเจน
# เช่น "Insufficient cpu" / "Insufficient memory"
# หรือ "0/3 nodes are available: 3 node(s) didn't match"

# ดู Node resources:
kubectl describe nodes | grep -A 5 "Allocated resources"
kubectl top nodes

# แก้ไข:
# - ลด Resource requests ของ Pod
# - เพิ่ม Node (Scale up cluster)
# - ตรวจ Node selector / Affinity / Tolerations

OOMKilled

Container ถูก Kill เพราะใช้ Memory เกิน Limit:

# ตรวจสอบ:
kubectl describe pod oom-pod
# Last State: Terminated - Reason: OOMKilled - Exit Code: 137

# ดู Memory usage ปัจจุบัน:
kubectl top pod oom-pod

# ดู Limit ที่ตั้งไว้:
kubectl get pod oom-pod -o jsonpath='{.spec.containers[0].resources}'

# แก้ไข:
# 1. เพิ่ม Memory limit (ถ้า App ต้องการจริง)
# 2. Fix Memory leak ใน Application
# 3. ปรับ JVM heap size (ถ้า Java)
#    -Xmx ต้องน้อยกว่า Container memory limit 20-30%
# 4. ตรวจสอบ Go goroutine leak (ถ้า Go)

Evicted

Pod ถูก Evict ออกจาก Node เพราะ Node มี Resource pressure:

# สาเหตุ:
# DiskPressure — Disk เต็ม (> 85%)
# MemoryPressure — Memory เหลือน้อย
# PIDPressure — Process ID เต็ม

# ตรวจ:
kubectl get pods --field-selector=status.phase=Failed
kubectl describe pod evicted-pod
# Reason: Evicted
# Message: "The node was low on resource: ephemeral-storage"

# แก้ไข:
# 1. ลบ Evicted pods: kubectl delete pods --field-selector=status.phase=Failed
# 2. ตรวจ Disk usage บน Node
# 3. ตั้ง ephemeral-storage limit ใน Pod spec
# 4. ตรวจว่า Pod ไม่ได้เขียน Log/Data ลง Local disk มากเกินไป

Service Issues — ปัญหา Service

No Endpoints

# Service ไม่มี Endpoints (ไม่มี Pod ที่ Match)
kubectl get endpoints my-service
# NAME         ENDPOINTS   AGE
# my-service   <none>      5m

# สาเหตุ: Selector ของ Service ไม่ Match กับ Labels ของ Pod

# ตรวจ:
kubectl describe service my-service
# ดู "Selector" → เช่น app=web

kubectl get pods --show-labels
# ดูว่า Pod มี Label "app=web" ไหม

# แก้ไข: ให้ Service selector Match กับ Pod labels
# Service: selector: app: web
# Pod: labels: app: web   ← ต้องตรงกัน!

ClusterIP Not Reachable

# ทดสอบจากภายใน Cluster:
kubectl run test --rm -it --image=busybox -- sh
# จากภายใน:
wget -qO- http://my-service:80
nslookup my-service
# ถ้า nslookup ไม่ได้ → DNS issue
# ถ้า nslookup ได้แต่ wget timeout → Pod ไม่ Respond / Port ผิด

# ตรวจ Port:
kubectl describe service my-service
# Port: 80 → TargetPort: 8080
# ตรวจว่า Container ของ Pod ฟังที่ Port 8080 จริง

Ingress Issues — ปัญหา Ingress

404 Not Found

# ตรวจ Ingress rules:
kubectl describe ingress my-ingress
# ดู Rules → Host, Path, Backend service, Port

# สาเหตุที่พบบ่อย:
# 1. Path ไม่ Match (เช่น Ingress ตั้ง /api แต่ Request มาที่ /api/)
# 2. Backend service ไม่มี หรือ Port ผิด
# 3. Ingress class ไม่ตรง

# ตรวจ Ingress controller logs:
kubectl logs -n ingress-nginx deploy/ingress-nginx-controller | tail -50

502 Bad Gateway

# Ingress controller ติดต่อ Backend ไม่ได้
# สาเหตุ:
# 1. Backend pods ไม่พร้อม (CrashLoopBackOff)
# 2. Service endpoints ว่าง
# 3. Backend pods ใช้เวลา Start นาน (ยังไม่ Ready)
# 4. Health check path ไม่ถูกต้อง

# ตรวจ:
kubectl get pods -l app=my-app         # Pod status
kubectl get endpoints my-service       # มี Endpoints ไหม
kubectl describe ingress my-ingress    # Annotations ถูกต้องไหม

SSL Errors

# ตรวจ Certificate:
kubectl describe certificate my-cert -n my-namespace
kubectl get certificaterequests -n my-namespace
kubectl describe order -n my-namespace

# cert-manager logs:
kubectl logs -n cert-manager deploy/cert-manager

# สาเหตุที่พบบ่อย:
# 1. DNS ยังไม่ Propagate
# 2. cert-manager ACME challenge ล้มเหลว
# 3. Secret ที่เก็บ Cert ไม่มี
# 4. Ingress ไม่ได้อ้าง tls secret

Networking Issues — ปัญหา Network

DNS Resolution Failures

# ทดสอบ DNS จากภายใน Pod:
kubectl exec -it my-pod -- nslookup kubernetes.default
kubectl exec -it my-pod -- nslookup my-service.my-namespace.svc.cluster.local

# ถ้า DNS ไม่ทำงาน:
# 1. ตรวจ CoreDNS pods:
kubectl get pods -n kube-system -l k8s-app=kube-dns
kubectl logs -n kube-system -l k8s-app=kube-dns

# 2. ตรวจ resolv.conf ใน Pod:
kubectl exec -it my-pod -- cat /etc/resolv.conf
# ควรเห็น: nameserver 10.96.0.10 (ClusterDNS IP)

# 3. ตรวจ CoreDNS configmap:
kubectl get configmap coredns -n kube-system -o yaml

Network Policy Blocking

# ตรวจ Network Policies:
kubectl get networkpolicies -A
kubectl describe networkpolicy my-policy

# ถ้ามี Network Policy ที่เป็น "deny all" default:
# ต้องสร้าง Policy ที่ Allow traffic ที่ต้องการ
# ทดสอบ: ลอง Delete network policy ดูว่า Traffic ผ่านไหม
# (ระวัง: ทำใน Test environment เท่านั้น!)

Storage Issues — ปัญหา Storage

PVC Pending

# PersistentVolumeClaim ค้าง Pending:
kubectl get pvc
# NAME      STATUS    VOLUME   CAPACITY   ACCESS MODES   STORAGECLASS   AGE
# my-pvc    Pending                                       standard       2m

kubectl describe pvc my-pvc
# Events: "waiting for first consumer to be created"
# หรือ "no persistent volumes available for this claim"

# สาเหตุ:
# 1. ไม่มี StorageClass ที่ Match
# 2. StorageClass ไม่มี Provisioner ทำงาน
# 3. Volume Binding Mode = WaitForFirstConsumer (รอ Pod ใช้ก่อน)
# 4. Capacity ไม่พอ

# ตรวจ:
kubectl get storageclass
kubectl describe storageclass standard
kubectl get pv  # ดู Available PV

Mount Errors

# Pod Start ไม่ได้เพราะ Mount volume ล้มเหลว:
# "Unable to attach or mount volumes"
# "MountVolume.SetUp failed"

# สาเหตุ:
# 1. Volume ถูก Attach กับ Node อื่นอยู่ (ใช้ RWO access mode)
# 2. Disk ไม่มีอยู่จริง (Cloud disk ถูกลบ)
# 3. Permission denied (File system permissions)
# 4. NFS server unreachable

# แก้ไข:
# ถ้า RWO → ต้องให้ Pod อยู่บน Node เดียวกับ Volume
# ใช้ nodeAffinity หรือเปลี่ยนเป็น RWX (ReadWriteMany)

Node Issues — ปัญหา Node

NotReady

# ตรวจ Node status:
kubectl get nodes
# NAME     STATUS     ROLES    AGE    VERSION
# node-1   NotReady   worker   30d    v1.28.0

kubectl describe node node-1
# ดู Conditions:
# Ready       False  → kubelet ไม่ตอบ
# MemoryPressure True → Memory เหลือน้อย
# DiskPressure   True → Disk เกือบเต็ม

# สาเหตุ:
# 1. kubelet หยุดทำงาน
# 2. Node offline / Network unreachable
# 3. Docker/containerd หยุดทำงาน
# 4. Resource exhaustion (Disk full, OOM)

# แก้ไข: SSH เข้า Node
systemctl status kubelet
systemctl status containerd
journalctl -u kubelet --since "10 minutes ago"
df -h   # ตรวจ Disk
free -m # ตรวจ Memory

DiskPressure / MemoryPressure

# Node มี Resource pressure → K8s จะ Evict pods

# DiskPressure:
# Default threshold: nodefs.available < 10%
# แก้ไข: ลบ unused images, ลบ old logs, ลบ terminated pods data
# Docker: docker system prune -a
# Containerd: crictl rmi --prune

# MemoryPressure:
# Default threshold: memory.available < 100Mi
# แก้ไข: ลด Pod memory limits, เพิ่ม Node memory, Scale out

RBAC Issues — ปัญหาสิทธิ์การเข้าถึง

# Error: "forbidden" / "unauthorized"
# User/ServiceAccount ไม่มีสิทธิ์ทำ Action ที่ต้องการ

# ตรวจ:
kubectl auth can-i create pods --as=system:serviceaccount:default:my-sa
# yes/no

kubectl auth can-i list secrets --as=user@example.com
# no - ไม่มีสิทธิ์

# ดู RoleBindings ที่เกี่ยวข้อง:
kubectl get rolebindings -A | grep my-sa
kubectl get clusterrolebindings | grep my-sa

# สร้าง Role + RoleBinding:
kubectl create role pod-reader --verb=get,list,watch --resource=pods
kubectl create rolebinding my-sa-read-pods     --role=pod-reader     --serviceaccount=default:my-sa

# ดู ServiceAccount ของ Pod:
kubectl get pod my-pod -o jsonpath='{.spec.serviceAccountName}'

Essential kubectl Commands for Debugging

คำสั่ง	ใช้ทำอะไร
`kubectl get pods -o wide`	ดู Pod status + Node + IP
`kubectl describe pod <name>`	ดูรายละเอียด + Events ของ Pod
`kubectl logs <pod>`	ดู Container logs
`kubectl logs <pod> --previous`	ดู Logs ของ Container ก่อนหน้าที่ Crash
`kubectl logs <pod> -c <container>`	ดู Logs ของ Container เฉพาะ (Multi-container pod)
`kubectl exec -it <pod> -- sh`	เข้า Shell ภายใน Container
`kubectl port-forward <pod> 8080:80`	Forward port จาก Pod มา Local
`kubectl top pods`	ดู CPU/Memory usage ของ Pods
`kubectl top nodes`	ดู CPU/Memory usage ของ Nodes
`kubectl get events --sort-by=.lastTimestamp`	ดู Events ล่าสุดทั้ง Cluster
`kubectl get pods --field-selector=status.phase!=Running`	ดู Pods ที่ไม่ได้ Running
`kubectl rollout status deploy/<name>`	ดู Deployment rollout status
`kubectl rollout undo deploy/<name>`	Rollback Deployment

Ephemeral Debug Containers

K8s 1.25+ มี Ephemeral debug containers สำหรับ Debug pod ที่ไม่มี Shell (Distroless images):

# เพิ่ม Debug container เข้าไปใน Running pod
kubectl debug -it my-pod --image=busybox --target=my-container

# Debug container จะ Share:
# - Process namespace (เห็น Processes ของ target container)
# - Network namespace (เห็น Network เหมือน target container)
# แต่จะมี Filesystem แยก (ใช้ busybox filesystem)

# Debug Node (สร้าง Pod ที่ Access host filesystem):
kubectl debug node/my-node -it --image=ubuntu
# จะ Mount host root filesystem ที่ /host
# chroot /host  → เข้า Host filesystem

crictl — Container Runtime Debugging

# crictl ใช้ Debug containerd/CRI-O โดยตรง (SSH เข้า Node)
# เมื่อ kubectl ใช้ไม่ได้ หรือต้องการข้อมูลระดับ Runtime

crictl pods                    # ดู Pods ทั้งหมดบน Node นี้
crictl ps                      # ดู Containers ที่ Running
crictl ps -a                   # ดู Containers ทั้งหมด (รวม Stopped)
crictl logs <container-id>     # ดู Container logs
crictl inspect <container-id>  # ดูรายละเอียด Container
crictl images                  # ดู Images บน Node
crictl rmi <image-id>          # ลบ Image
crictl stats                   # ดู Resource usage

Common Mistakes and Fixes Checklist

ปัญหา	สาเหตุที่พบบ่อย	คำสั่งตรวจสอบ	วิธีแก้
Pod CrashLoopBackOff	App error, Missing env var	`kubectl logs --previous`	Fix app code, Add configmap/secret
Pod ImagePullBackOff	Wrong image name, No auth	`kubectl describe pod`	Fix image name, Add imagePullSecrets
Pod Pending	Insufficient resources	`kubectl describe pod` (Events)	Reduce requests, Add nodes
Pod OOMKilled	Memory limit too low	`kubectl describe pod` (Exit 137)	Increase memory limit, Fix memory leak
Service no endpoints	Selector mismatch	`kubectl get endpoints`	Match service selector with pod labels
Ingress 502	Backend pods not ready	`kubectl get pods, endpoints`	Fix backend pods, Check readiness probe
DNS not working	CoreDNS down	`kubectl get pods -n kube-system`	Restart CoreDNS, Check configmap
PVC Pending	No StorageClass/Provisioner	`kubectl describe pvc`	Create StorageClass, Fix provisioner
Node NotReady	kubelet down, Disk full	`kubectl describe node`	SSH to node, Fix kubelet/disk
RBAC Forbidden	Missing Role/RoleBinding	`kubectl auth can-i`	Create Role + RoleBinding

สรุป

Kubernetes Troubleshooting เป็นทักษะที่ต้องฝึกฝนอย่างสม่ำเสมอ ยิ่งเจอปัญหาเยอะยิ่งเก่ง สิ่งสำคัญที่สุดคือ Systematic approach — อย่า Guess แต่ให้ Check ทีละขั้นตอน: Status → Describe → Logs → Events → Exec

คำสั่งที่ใช้บ่อยที่สุดในการ Debug คือ kubectl describe (ดู Events ที่บอกปัญหาชัดเจน) และ kubectl logs --previous (ดู Logs ของ Container ที่ Crash ไปแล้ว) สองคำสั่งนี้แก้ได้ 80% ของปัญหาที่พบ ที่เหลืออีก 20% ต้องใช้ kubectl exec เข้าไปดูข้างใน หรือ Debug ที่ระดับ Node ด้วย crictl

จำ Checklist: CrashLoop → ดู logs --previous, ImagePull → ตรวจ image name + registry auth, Pending → ดู resources + scheduler events, Service 502 → ตรวจ endpoints + pod readiness ฝึกจนเป็นธรรมชาติ แล้วคุณจะแก้ปัญหา K8s ได้อย่างรวดเร็ว