SiamCafe.net Blog
Technology

PagerDuty Incident Pod Scheduling

pagerduty incident pod scheduling
PagerDuty Incident Pod Scheduling | SiamCafe Blog
2025-07-03· อ. บอม — SiamCafe.net· 10,714 คำ

PagerDuty Pod Scheduling

PagerDuty Incident Management Kubernetes Pod Scheduling Alert Prometheus On-call Escalation Auto-remediation Runbook MTTA MTTR

Pod StatusCausePagerDuty SeverityAuto-remediationSLA
Pending (Resource)CPU/Memory insufficientHighCluster Autoscaler15 min
Pending (Scheduling)Affinity/Taint mismatchHighFix labels or tolerations30 min
CrashLoopBackOffApp crash, config errorCriticalRollback deployment10 min
ImagePullBackOffImage not found, auth failHighCheck registry, fix secret15 min
OOMKilledMemory limit exceededHighIncrease memory limit15 min
EvictedNode disk pressureWarningCleanup disk, expand PV30 min

Alert Configuration

# === Prometheus + PagerDuty Setup ===

# Alertmanager config (alertmanager.yml)
# global:
#   resolve_timeout: 5m
# route:
#   receiver: 'pagerduty-critical'
#   routes:
#     - match:
#         severity: critical
#       receiver: 'pagerduty-critical'
#       repeat_interval: 5m
#     - match:
#         severity: warning
#       receiver: 'pagerduty-warning'
#       repeat_interval: 30m
# receivers:
#   - name: 'pagerduty-critical'
#     pagerduty_configs:
#       - routing_key: 'YOUR_INTEGRATION_KEY'
#         severity: critical
#   - name: 'pagerduty-warning'
#     pagerduty_configs:
#       - routing_key: 'YOUR_INTEGRATION_KEY'
#         severity: warning

# Prometheus Alert Rules (pod-alerts.yml)
# groups:
#   - name: kubernetes-pods
#     rules:
#       - alert: PodPending
#         expr: kube_pod_status_phase{phase="Pending"} > 0
#         for: 5m
#         labels: { severity: high }
#         annotations:
#           summary: "Pod {{ $labels.pod }} pending for 5m"
#       - alert: PodCrashLooping
#         expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
#         for: 5m
#         labels: { severity: critical }
#       - alert: PodOOMKilled
#         expr: kube_pod_container_status_last_terminated_reason{reason="OOMKilled"} > 0
#         labels: { severity: high }
#       - alert: NodeNotReady
#         expr: kube_node_status_condition{condition="Ready", status="true"} == 0
#         for: 2m
#         labels: { severity: critical }

from dataclasses import dataclass

@dataclass
class AlertRule:
    alert: str
    expr: str
    duration: str
    severity: str
    action: str

alerts = [
    AlertRule("PodPending", "kube_pod_status_phase{phase='Pending'} > 0",
        "5m", "high", "Check resources, node capacity, scheduling constraints"),
    AlertRule("PodCrashLooping", "rate(kube_pod_container_status_restarts[15m]) > 0",
        "5m", "critical", "Check logs, rollback if recent deploy"),
    AlertRule("PodOOMKilled", "last_terminated_reason='OOMKilled'",
        "0m", "high", "Increase memory limits, check for memory leaks"),
    AlertRule("NodeNotReady", "kube_node_status_condition{Ready}=false",
        "2m", "critical", "Drain node, investigate kubelet, check hardware"),
    AlertRule("HighPodRestarts", "restarts > 5 in 1h",
        "10m", "warning", "Check app health, config, dependencies"),
    AlertRule("PVCPending", "kube_persistentvolumeclaim_status_phase{phase='Pending'}",
        "5m", "high", "Check StorageClass, PV availability"),
]

print("=== Alert Rules ===")
for a in alerts:
    print(f"  [{a.severity.upper()}] {a.alert}")
    print(f"    Expr: {a.expr} | For: {a.duration}")
    print(f"    Action: {a.action}")

On-call and Escalation

# === On-call Management ===

@dataclass
class EscalationLevel:
    level: int
    who: str
    timeout_min: int
    notification: str
    action: str

escalation = [
    EscalationLevel(1, "Primary On-call Engineer",
        5, "SMS + Phone Call + Push",
        "Acknowledge, start diagnosis, follow runbook"),
    EscalationLevel(2, "Secondary On-call Engineer",
        10, "SMS + Phone Call",
        "Take over if primary unavailable"),
    EscalationLevel(3, "Team Lead / Engineering Manager",
        15, "SMS + Phone Call + Email",
        "Coordinate response, decide escalation"),
    EscalationLevel(4, "VP Engineering / CTO",
        30, "Phone Call",
        "Major incident, customer impact, executive decision"),
]

print("=== Escalation Policy ===")
for e in escalation:
    print(f"  Level {e.level}: {e.who}")
    print(f"    Timeout: {e.timeout_min} min | Notify: {e.notification}")
    print(f"    Action: {e.action}")

# On-call schedule
@dataclass
class OnCallSchedule:
    rotation: str
    schedule: str
    handoff: str
    override: str

schedules = [
    OnCallSchedule("Primary", "Weekly rotation (Mon 09:00 - Mon 09:00)",
        "30 min handoff meeting, review open incidents",
        "PagerDuty override for vacation/sick"),
    OnCallSchedule("Secondary", "Weekly rotation (offset by 1 week)",
        "Backup for primary, same handoff process",
        "Auto-assign if primary overridden"),
    OnCallSchedule("Weekend", "Separate weekend rotation (Fri 18:00 - Mon 09:00)",
        "Friday EOD briefing on current issues",
        "Volunteers first, then round-robin"),
]

print(f"\n\nOn-call Schedules:")
for s in schedules:
    print(f"  [{s.rotation}] {s.schedule}")
    print(f"    Handoff: {s.handoff}")
    print(f"    Override: {s.override}")

Runbooks and Metrics

# === Incident Metrics ===

@dataclass
class IncidentMetric:
    metric: str
    target: str
    calculation: str
    improve: str

metrics = [
    IncidentMetric("MTTA (Mean Time to Acknowledge)",
        "< 5 minutes",
        "Average time from alert to acknowledgment",
        "Better on-call tools, clear notification, reduce noise"),
    IncidentMetric("MTTR (Mean Time to Resolve)",
        "< 30 minutes (P1), < 4 hours (P2)",
        "Average time from alert to resolution",
        "Better runbooks, auto-remediation, training"),
    IncidentMetric("Incident Count",
        "Decreasing trend",
        "Number of incidents per week/month",
        "Fix root causes, improve reliability"),
    IncidentMetric("Escalation Rate",
        "< 10%",
        "% of incidents escalated beyond L1",
        "Better training, clearer runbooks"),
    IncidentMetric("False Positive Rate",
        "< 5%",
        "% of alerts that are not real incidents",
        "Tune alert thresholds, better conditions"),
    IncidentMetric("Postmortem Completion",
        "100% for P1/P2",
        "% of major incidents with completed postmortem",
        "Process enforcement, templates"),
]

print("=== Incident Metrics ===")
for m in metrics:
    print(f"  [{m.metric}] Target: {m.target}")
    print(f"    Calculation: {m.calculation}")
    print(f"    Improve: {m.improve}")

เคล็ดลับ

การดูแลระบบในสภาพแวดล้อม Production

การบริหารจัดการระบบ Production ที่ดีต้องมี Monitoring ครอบคลุม ใช้เครื่องมืออย่าง Prometheus + Grafana สำหรับ Metrics Collection และ Dashboard หรือ ELK Stack สำหรับ Log Management ตั้ง Alert ให้แจ้งเตือนเมื่อ CPU เกิน 80% RAM ใกล้เต็ม หรือ Disk Usage สูง

Backup Strategy ต้องวางแผนให้ดี ใช้หลัก 3-2-1 คือ มี Backup อย่างน้อย 3 ชุด เก็บใน Storage 2 ประเภทต่างกัน และ 1 ชุดต้องอยู่ Off-site ทดสอบ Restore Backup เป็นประจำ อย่างน้อยเดือนละครั้ง เพราะ Backup ที่ Restore ไม่ได้ก็เหมือนไม่มี Backup

เรื่อง Security Hardening ต้องทำตั้งแต่เริ่มต้น ปิด Port ที่ไม่จำเป็น ใช้ SSH Key แทน Password ตั้ง Fail2ban ป้องกัน Brute Force อัพเดท Security Patch สม่ำเสมอ และทำ Vulnerability Scanning อย่างน้อยเดือนละครั้ง ใช้หลัก Principle of Least Privilege ให้สิทธิ์น้อยที่สุดที่จำเป็น

PagerDuty คืออะไร

Incident Management Alert Monitoring Prometheus Datadog On-call SMS Phone Email Escalation Timeline Postmortem Analytics MTTA MTTR

Pod Scheduling Problem คืออะไร

Kubernetes Pod Node Resource CPU Memory Affinity Taint Toleration PV Priority Quota NotReady ImagePullBackOff CrashLoopBackOff OOMKilled

ตั้งค่า Alert อย่างไร

PagerDuty Service Integration Key Alertmanager Prometheus Alert Rules Severity Routing Critical High Low Escalation 5 นาที Repeat

Auto-remediation ทำอย่างไร

Rundeck Script Scale Node Autoscaler Delete Pod ReplicaSet Drain Node Cleanup Images Logs Certificate Renew Runbook On-call ขั้นตอน

สรุป

PagerDuty Incident Pod Scheduling Kubernetes Alert Prometheus Alertmanager On-call Escalation Auto-remediation Runbook MTTA MTTR Postmortem Production

📖 บทความที่เกี่ยวข้อง

PagerDuty Incident Chaos Engineeringอ่านบทความ → PagerDuty Incident Message Queue Designอ่านบทความ → PagerDuty Incident Batch Processing Pipelineอ่านบทความ → PagerDuty Incident Freelance IT Careerอ่านบทความ → Fivetran Connector Pod Schedulingอ่านบทความ →

📚 ดูบทความทั้งหมด →