PagerDuty Pod Scheduling
PagerDuty Incident Management Kubernetes Pod Scheduling Alert Prometheus On-call Escalation Auto-remediation Runbook MTTA MTTR
| Pod Status | Cause | PagerDuty Severity | Auto-remediation | SLA |
|---|---|---|---|---|
| Pending (Resource) | CPU/Memory insufficient | High | Cluster Autoscaler | 15 min |
| Pending (Scheduling) | Affinity/Taint mismatch | High | Fix labels or tolerations | 30 min |
| CrashLoopBackOff | App crash, config error | Critical | Rollback deployment | 10 min |
| ImagePullBackOff | Image not found, auth fail | High | Check registry, fix secret | 15 min |
| OOMKilled | Memory limit exceeded | High | Increase memory limit | 15 min |
| Evicted | Node disk pressure | Warning | Cleanup disk, expand PV | 30 min |
Alert Configuration
# === Prometheus + PagerDuty Setup ===
# Alertmanager config (alertmanager.yml)
# global:
# resolve_timeout: 5m
# route:
# receiver: 'pagerduty-critical'
# routes:
# - match:
# severity: critical
# receiver: 'pagerduty-critical'
# repeat_interval: 5m
# - match:
# severity: warning
# receiver: 'pagerduty-warning'
# repeat_interval: 30m
# receivers:
# - name: 'pagerduty-critical'
# pagerduty_configs:
# - routing_key: 'YOUR_INTEGRATION_KEY'
# severity: critical
# - name: 'pagerduty-warning'
# pagerduty_configs:
# - routing_key: 'YOUR_INTEGRATION_KEY'
# severity: warning
# Prometheus Alert Rules (pod-alerts.yml)
# groups:
# - name: kubernetes-pods
# rules:
# - alert: PodPending
# expr: kube_pod_status_phase{phase="Pending"} > 0
# for: 5m
# labels: { severity: high }
# annotations:
# summary: "Pod {{ $labels.pod }} pending for 5m"
# - alert: PodCrashLooping
# expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
# for: 5m
# labels: { severity: critical }
# - alert: PodOOMKilled
# expr: kube_pod_container_status_last_terminated_reason{reason="OOMKilled"} > 0
# labels: { severity: high }
# - alert: NodeNotReady
# expr: kube_node_status_condition{condition="Ready", status="true"} == 0
# for: 2m
# labels: { severity: critical }
from dataclasses import dataclass
@dataclass
class AlertRule:
alert: str
expr: str
duration: str
severity: str
action: str
alerts = [
AlertRule("PodPending", "kube_pod_status_phase{phase='Pending'} > 0",
"5m", "high", "Check resources, node capacity, scheduling constraints"),
AlertRule("PodCrashLooping", "rate(kube_pod_container_status_restarts[15m]) > 0",
"5m", "critical", "Check logs, rollback if recent deploy"),
AlertRule("PodOOMKilled", "last_terminated_reason='OOMKilled'",
"0m", "high", "Increase memory limits, check for memory leaks"),
AlertRule("NodeNotReady", "kube_node_status_condition{Ready}=false",
"2m", "critical", "Drain node, investigate kubelet, check hardware"),
AlertRule("HighPodRestarts", "restarts > 5 in 1h",
"10m", "warning", "Check app health, config, dependencies"),
AlertRule("PVCPending", "kube_persistentvolumeclaim_status_phase{phase='Pending'}",
"5m", "high", "Check StorageClass, PV availability"),
]
print("=== Alert Rules ===")
for a in alerts:
print(f" [{a.severity.upper()}] {a.alert}")
print(f" Expr: {a.expr} | For: {a.duration}")
print(f" Action: {a.action}")
On-call and Escalation
# === On-call Management ===
@dataclass
class EscalationLevel:
level: int
who: str
timeout_min: int
notification: str
action: str
escalation = [
EscalationLevel(1, "Primary On-call Engineer",
5, "SMS + Phone Call + Push",
"Acknowledge, start diagnosis, follow runbook"),
EscalationLevel(2, "Secondary On-call Engineer",
10, "SMS + Phone Call",
"Take over if primary unavailable"),
EscalationLevel(3, "Team Lead / Engineering Manager",
15, "SMS + Phone Call + Email",
"Coordinate response, decide escalation"),
EscalationLevel(4, "VP Engineering / CTO",
30, "Phone Call",
"Major incident, customer impact, executive decision"),
]
print("=== Escalation Policy ===")
for e in escalation:
print(f" Level {e.level}: {e.who}")
print(f" Timeout: {e.timeout_min} min | Notify: {e.notification}")
print(f" Action: {e.action}")
# On-call schedule
@dataclass
class OnCallSchedule:
rotation: str
schedule: str
handoff: str
override: str
schedules = [
OnCallSchedule("Primary", "Weekly rotation (Mon 09:00 - Mon 09:00)",
"30 min handoff meeting, review open incidents",
"PagerDuty override for vacation/sick"),
OnCallSchedule("Secondary", "Weekly rotation (offset by 1 week)",
"Backup for primary, same handoff process",
"Auto-assign if primary overridden"),
OnCallSchedule("Weekend", "Separate weekend rotation (Fri 18:00 - Mon 09:00)",
"Friday EOD briefing on current issues",
"Volunteers first, then round-robin"),
]
print(f"\n\nOn-call Schedules:")
for s in schedules:
print(f" [{s.rotation}] {s.schedule}")
print(f" Handoff: {s.handoff}")
print(f" Override: {s.override}")
Runbooks and Metrics
# === Incident Metrics ===
@dataclass
class IncidentMetric:
metric: str
target: str
calculation: str
improve: str
metrics = [
IncidentMetric("MTTA (Mean Time to Acknowledge)",
"< 5 minutes",
"Average time from alert to acknowledgment",
"Better on-call tools, clear notification, reduce noise"),
IncidentMetric("MTTR (Mean Time to Resolve)",
"< 30 minutes (P1), < 4 hours (P2)",
"Average time from alert to resolution",
"Better runbooks, auto-remediation, training"),
IncidentMetric("Incident Count",
"Decreasing trend",
"Number of incidents per week/month",
"Fix root causes, improve reliability"),
IncidentMetric("Escalation Rate",
"< 10%",
"% of incidents escalated beyond L1",
"Better training, clearer runbooks"),
IncidentMetric("False Positive Rate",
"< 5%",
"% of alerts that are not real incidents",
"Tune alert thresholds, better conditions"),
IncidentMetric("Postmortem Completion",
"100% for P1/P2",
"% of major incidents with completed postmortem",
"Process enforcement, templates"),
]
print("=== Incident Metrics ===")
for m in metrics:
print(f" [{m.metric}] Target: {m.target}")
print(f" Calculation: {m.calculation}")
print(f" Improve: {m.improve}")
เคล็ดลับ
- Noise: ลด Alert Noise ส่งเฉพาะ Actionable Alerts ไม่ส่ง Info
- Runbook: เขียน Runbook ทุก Alert ให้ On-call ทำตามได้ทันที
- Auto: ใช้ Auto-remediation สำหรับปัญหาที่ซ้ำบ่อยและมีวิธีแก้ชัดเจน
- Postmortem: ทำ Blameless Postmortem ทุก Major Incident
- Training: ฝึก On-call ใหม่ Shadow On-call 1-2 สัปดาห์ก่อนเข้า Rotation
การดูแลระบบในสภาพแวดล้อม Production
การบริหารจัดการระบบ Production ที่ดีต้องมี Monitoring ครอบคลุม ใช้เครื่องมืออย่าง Prometheus + Grafana สำหรับ Metrics Collection และ Dashboard หรือ ELK Stack สำหรับ Log Management ตั้ง Alert ให้แจ้งเตือนเมื่อ CPU เกิน 80% RAM ใกล้เต็ม หรือ Disk Usage สูง
Backup Strategy ต้องวางแผนให้ดี ใช้หลัก 3-2-1 คือ มี Backup อย่างน้อย 3 ชุด เก็บใน Storage 2 ประเภทต่างกัน และ 1 ชุดต้องอยู่ Off-site ทดสอบ Restore Backup เป็นประจำ อย่างน้อยเดือนละครั้ง เพราะ Backup ที่ Restore ไม่ได้ก็เหมือนไม่มี Backup
เรื่อง Security Hardening ต้องทำตั้งแต่เริ่มต้น ปิด Port ที่ไม่จำเป็น ใช้ SSH Key แทน Password ตั้ง Fail2ban ป้องกัน Brute Force อัพเดท Security Patch สม่ำเสมอ และทำ Vulnerability Scanning อย่างน้อยเดือนละครั้ง ใช้หลัก Principle of Least Privilege ให้สิทธิ์น้อยที่สุดที่จำเป็น
PagerDuty คืออะไร
Incident Management Alert Monitoring Prometheus Datadog On-call SMS Phone Email Escalation Timeline Postmortem Analytics MTTA MTTR
Pod Scheduling Problem คืออะไร
Kubernetes Pod Node Resource CPU Memory Affinity Taint Toleration PV Priority Quota NotReady ImagePullBackOff CrashLoopBackOff OOMKilled
ตั้งค่า Alert อย่างไร
PagerDuty Service Integration Key Alertmanager Prometheus Alert Rules Severity Routing Critical High Low Escalation 5 นาที Repeat
Auto-remediation ทำอย่างไร
Rundeck Script Scale Node Autoscaler Delete Pod ReplicaSet Drain Node Cleanup Images Logs Certificate Renew Runbook On-call ขั้นตอน
สรุป
PagerDuty Incident Pod Scheduling Kubernetes Alert Prometheus Alertmanager On-call Escalation Auto-remediation Runbook MTTA MTTR Postmortem Production
