PagerDuty Incident Pod Scheduling — จัดการ

PagerDuty Pod Scheduling

PagerDuty Incident Management Kubernetes Pod Scheduling Alert Prometheus On-call Escalation Auto-remediation Runbook MTTA MTTR

Pod Status	Cause	PagerDuty Severity	Auto-remediation	SLA
Pending (Resource)	CPU/Memory insufficient	High	Cluster Autoscaler	15 min
Pending (Scheduling)	Affinity/Taint mismatch	High	Fix labels or tolerations	30 min
CrashLoopBackOff	App crash, config error	Critical	Rollback deployment	10 min
ImagePullBackOff	Image not found, auth fail	High	Check registry, fix secret	15 min
OOMKilled	Memory limit exceeded	High	Increase memory limit	15 min
Evicted	Node disk pressure	Warning	Cleanup disk, expand PV	30 min

Alert Configuration

# === Prometheus + PagerDuty Setup ===

# Alertmanager config (alertmanager.yml)
# global:
#   resolve_timeout: 5m
# route:
#   receiver: 'pagerduty-critical'
#   routes:
#     - match:
#         severity: critical
#       receiver: 'pagerduty-critical'
#       repeat_interval: 5m
#     - match:
#         severity: warning
#       receiver: 'pagerduty-warning'
#       repeat_interval: 30m
# receivers:
#   - name: 'pagerduty-critical'
#     pagerduty_configs:
#       - routing_key: 'YOUR_INTEGRATION_KEY'
#         severity: critical
#   - name: 'pagerduty-warning'
#     pagerduty_configs:
#       - routing_key: 'YOUR_INTEGRATION_KEY'
#         severity: warning

# Prometheus Alert Rules (pod-alerts.yml)
# groups:
#   - name: kubernetes-pods
#     rules:
#       - alert: PodPending
#         expr: kube_pod_status_phase{phase="Pending"} > 0
#         for: 5m
#         labels: { severity: high }
#         annotations:
#           summary: "Pod {{ $labels.pod }} pending for 5m"
#       - alert: PodCrashLooping
#         expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
#         for: 5m
#         labels: { severity: critical }
#       - alert: PodOOMKilled
#         expr: kube_pod_container_status_last_terminated_reason{reason="OOMKilled"} > 0
#         labels: { severity: high }
#       - alert: NodeNotReady
#         expr: kube_node_status_condition{condition="Ready", status="true"} == 0
#         for: 2m
#         labels: { severity: critical }

from dataclasses import dataclass

@dataclass
class AlertRule:
    alert: str
    expr: str
    duration: str
    severity: str
    action: str

alerts = [
    AlertRule("PodPending", "kube_pod_status_phase{phase='Pending'} > 0",
        "5m", "high", "Check resources, node capacity, scheduling constraints"),
    AlertRule("PodCrashLooping", "rate(kube_pod_container_status_restarts[15m]) > 0",
        "5m", "critical", "Check logs, rollback if recent deploy"),
    AlertRule("PodOOMKilled", "last_terminated_reason='OOMKilled'",
        "0m", "high", "Increase memory limits, check for memory leaks"),
    AlertRule("NodeNotReady", "kube_node_status_condition{Ready}=false",
        "2m", "critical", "Drain node, investigate kubelet, check hardware"),
    AlertRule("HighPodRestarts", "restarts > 5 in 1h",
        "10m", "warning", "Check app health, config, dependencies"),
    AlertRule("PVCPending", "kube_persistentvolumeclaim_status_phase{phase='Pending'}",
        "5m", "high", "Check StorageClass, PV availability"),
]

print("=== Alert Rules ===")
for a in alerts:
    print(f"  [{a.severity.upper()}] {a.alert}")
    print(f"    Expr: {a.expr} | For: {a.duration}")
    print(f"    Action: {a.action}")

On-call and Escalation

# === On-call Management ===

@dataclass
class EscalationLevel:
    level: int
    who: str
    timeout_min: int
    notification: str
    action: str

escalation = [
    EscalationLevel(1, "Primary On-call Engineer",
        5, "SMS + Phone Call + Push",
        "Acknowledge, start diagnosis, follow runbook"),
    EscalationLevel(2, "Secondary On-call Engineer",
        10, "SMS + Phone Call",
        "Take over if primary unavailable"),
    EscalationLevel(3, "Team Lead / Engineering Manager",
        15, "SMS + Phone Call + Email",
        "Coordinate response, decide escalation"),
    EscalationLevel(4, "VP Engineering / CTO",
        30, "Phone Call",
        "Major incident, customer impact, executive decision"),
]

print("=== Escalation Policy ===")
for e in escalation:
    print(f"  Level {e.level}: {e.who}")
    print(f"    Timeout: {e.timeout_min} min | Notify: {e.notification}")
    print(f"    Action: {e.action}")

# On-call schedule
@dataclass
class OnCallSchedule:
    rotation: str
    schedule: str
    handoff: str
    override: str

schedules = [
    OnCallSchedule("Primary", "Weekly rotation (Mon 09:00 - Mon 09:00)",
        "30 min handoff meeting, review open incidents",
        "PagerDuty override for vacation/sick"),
    OnCallSchedule("Secondary", "Weekly rotation (offset by 1 week)",
        "Backup for primary, same handoff process",
        "Auto-assign if primary overridden"),
    OnCallSchedule("Weekend", "Separate weekend rotation (Fri 18:00 - Mon 09:00)",
        "Friday EOD briefing on current issues",
        "Volunteers first, then round-robin"),
]

print(f"\n\nOn-call Schedules:")
for s in schedules:
    print(f"  [{s.rotation}] {s.schedule}")
    print(f"    Handoff: {s.handoff}")
    print(f"    Override: {s.override}")

Runbooks and Metrics

# === Incident Metrics ===

@dataclass
class IncidentMetric:
    metric: str
    target: str
    calculation: str
    improve: str

metrics = [
    IncidentMetric("MTTA (Mean Time to Acknowledge)",
        "< 5 minutes",
        "Average time from alert to acknowledgment",
        "Better on-call tools, clear notification, reduce noise"),
    IncidentMetric("MTTR (Mean Time to Resolve)",
        "< 30 minutes (P1), < 4 hours (P2)",
        "Average time from alert to resolution",
        "Better runbooks, auto-remediation, training"),
    IncidentMetric("Incident Count",
        "Decreasing trend",
        "Number of incidents per week/month",
        "Fix root causes, improve reliability"),
    IncidentMetric("Escalation Rate",
        "< 10%",
        "% of incidents escalated beyond L1",
        "Better training, clearer runbooks"),
    IncidentMetric("False Positive Rate",
        "< 5%",
        "% of alerts that are not real incidents",
        "Tune alert thresholds, better conditions"),
    IncidentMetric("Postmortem Completion",
        "100% for P1/P2",
        "% of major incidents with completed postmortem",
        "Process enforcement, templates"),
]

print("=== Incident Metrics ===")
for m in metrics:
    print(f"  [{m.metric}] Target: {m.target}")
    print(f"    Calculation: {m.calculation}")
    print(f"    Improve: {m.improve}")

เคล็ดลับ

Noise: ลด Alert Noise ส่งเฉพาะ Actionable Alerts ไม่ส่ง Info
Runbook: เขียน Runbook ทุก Alert ให้ On-call ทำตามได้ทันที
Auto: ใช้ Auto-remediation สำหรับปัญหาที่ซ้ำบ่อยและมีวิธีแก้ชัดเจน
Postmortem: ทำ Blameless Postmortem ทุก Major Incident
Training: ฝึก On-call ใหม่ Shadow On-call 1-2 สัปดาห์ก่อนเข้า Rotation

การดูแลระบบในสภาพแวดล้อม Production

การบริหารจัดการระบบ Production ที่ดีต้องมี Monitoring ครอบคลุม ใช้เครื่องมืออย่าง Prometheus + Grafana สำหรับ Metrics Collection และ Dashboard หรือ ELK Stack สำหรับ Log Management ตั้ง Alert ให้แจ้งเตือนเมื่อ CPU เกิน 80% RAM ใกล้เต็ม หรือ Disk Usage สูง

Backup Strategy ต้องวางแผนให้ดี ใช้หลัก 3-2-1 คือ มี Backup อย่างน้อย 3 ชุด เก็บใน Storage 2 ประเภทต่างกัน และ 1 ชุดต้องอยู่ Off-site ทดสอบ Restore Backup เป็นประจำ อย่างน้อยเดือนละครั้ง เพราะ Backup ที่ Restore ไม่ได้ก็เหมือนไม่มี Backup

เนื้อหาเกี่ยวข้อง — ทำความเข้าใจ React Suspense MLOps Workflow

เรื่อง Security Hardening ต้องทำตั้งแต่เริ่มต้น ปิด Port ที่ไม่จำเป็น ใช้ SSH Key แทน Password ตั้ง Fail2ban ป้องกัน Brute Force อัพเดท Security Patch สม่ำเสมอ และทำ Vulnerability Scanning อย่างน้อยเดือนละครั้ง ใช้หลัก Principle of Least Privilege ให้สิทธิ์น้อยที่สุดที่จำเป็น

แนะนำเพิ่มเติม — สัญญาณเทรดรายวัน XM Signal

PagerDuty คืออะไร

Incident Management Alert Monitoring Prometheus Datadog On-call SMS Phone Email Escalation Timeline Postmortem Analytics MTTA MTTR

เนื้อหาเกี่ยวข้อง — ทำความเข้าใจ Flux CD GitOps Machine Learning Pipeline

Pod Scheduling Problem คืออะไร

Kubernetes Pod Node Resource CPU Memory Affinity Taint Toleration PV Priority Quota NotReady ImagePullBackOff CrashLoopBackOff OOMKilled

ตั้งค่า Alert อย่างไร

PagerDuty Service Integration Key Alertmanager Prometheus Alert Rules Severity Routing Critical High Low Escalation 5 นาที Repeat

แนะนำเพิ่มเติม — หนังสือเทรดที่ SiamCafeBook

เนื้อหาเกี่ยวข้อง — แนะนำให้อ่าน CDK Construct Container Orchestration

Auto-remediation ทำอย่างไร

Rundeck Script Scale Node Autoscaler Delete Pod ReplicaSet Drain Node Cleanup Images Logs Certificate Renew Runbook On-call ขั้นตอน

สรุป

PagerDuty Incident Pod Scheduling Kubernetes Alert Prometheus Alertmanager On-call Escalation Auto-remediation Runbook MTTA MTTR Postmortem Production

เนื้อหาเกี่ยวข้อง — ทำความเข้าใจ Whisper Speech Progressive Delivery