PagerDuty + IDP
PagerDuty Incident Internal Developer Platform Service Catalog Escalation On-call Automation Runbook Backstage Integration Production Operations
| Severity | Response SLA | Escalation | Notification | Example |
|---|---|---|---|---|
| P1 Critical | 5 min acknowledge | All levels immediately | Phone + SMS + Push | Service down, data loss |
| P2 High | 15 min acknowledge | Level 1 → Level 2 | Phone + Push | Degraded performance |
| P3 Medium | 1 hour acknowledge | Level 1 only | Push + Slack | Non-critical error spike |
| P4 Low | Next business day | No escalation | Slack + Email | Warning, tech debt |
Service Catalog Integration
# === PagerDuty + Backstage Integration ===
# Backstage catalog-info.yaml
# apiVersion: backstage.io/v1alpha1
# kind: Component
# metadata:
# name: payment-service
# annotations:
# pagerduty.com/service-id: PXXXXXX
# pagerduty.com/integration-key: xxxxxxxx
# tags: [python, fastapi, payments]
# spec:
# type: service
# lifecycle: production
# owner: team-payments
# system: e-commerce
# dependsOn:
# - component:database-service
# - component:notification-service
# PagerDuty Service Setup via API
# curl -X POST https://api.pagerduty.com/services \
# -H "Authorization: Token token=YOUR_API_KEY" \
# -H "Content-Type: application/json" \
# -d '{
# "service": {
# "name": "Payment Service",
# "escalation_policy": { "id": "PXXXXXX" },
# "alert_creation": "create_alerts_and_incidents",
# "auto_resolve_timeout": 14400,
# "acknowledgement_timeout": 600
# }
# }'
from dataclasses import dataclass
@dataclass
class ServiceEntry:
service: str
team: str
tier: str
on_call: str
escalation: str
runbook: str
services = [
ServiceEntry("payment-service", "team-payments", "Tier 1",
"Weekly rotation (4 engineers)", "P1: All levels | P2: L1→L2",
"runbooks/payment-service.md"),
ServiceEntry("user-service", "team-identity", "Tier 1",
"Weekly rotation (3 engineers)", "P1: All levels | P2: L1→L2",
"runbooks/user-service.md"),
ServiceEntry("search-service", "team-discovery", "Tier 2",
"Weekly rotation (3 engineers)", "P1: L1→L2 | P2: L1",
"runbooks/search-service.md"),
ServiceEntry("notification-service", "team-comms", "Tier 2",
"Bi-weekly rotation", "P1: L1→L2 | P2: L1",
"runbooks/notification-service.md"),
ServiceEntry("analytics-pipeline", "team-data", "Tier 3",
"Business hours only", "P1: L1 | P2: Next day",
"runbooks/analytics.md"),
]
print("=== Service Catalog ===")
for s in services:
print(f" [{s.service}] Team: {s.team} | Tier: {s.tier}")
print(f" On-call: {s.on_call}")
print(f" Escalation: {s.escalation}")
print(f" Runbook: {s.runbook}")
Event Orchestration
# === PagerDuty Event Orchestration ===
# Terraform PagerDuty Provider
# resource "pagerduty_event_orchestration_router" "main" {
# set {
# id = pagerduty_event_orchestration.main.id
# }
# catch_all {
# actions {
# route_to = pagerduty_service.default.id
# }
# }
# rule {
# label = "Route payment alerts"
# condition {
# expression = "event.custom_details.service == 'payment'"
# }
# actions {
# route_to = pagerduty_service.payment.id
# severity = "critical"
# }
# }
# rule {
# label = "Suppress known flaky alerts"
# condition {
# expression = "event.summary matches 'DNS timeout' and event.custom_details.env == 'staging'"
# }
# actions {
# suppress = true
# }
# }
# }
# Auto-remediation with Rundeck
# resource "pagerduty_event_orchestration_service" "payment" {
# rule {
# label = "Auto-restart on OOM"
# condition {
# expression = "event.summary matches 'OOMKilled'"
# }
# actions {
# automation_action {
# name = "Restart Pod"
# url = "https://rundeck.internal/api/job/restart-pod"
# auto_send = true
# }
# }
# }
# }
@dataclass
class OrchestrationRule:
rule: str
condition: str
action: str
result: str
rules = [
OrchestrationRule("Route by service", "event.service == 'payment'",
"Route to payment-service PD service", "Correct team gets alert"),
OrchestrationRule("Suppress staging", "env == 'staging' AND known_flaky",
"Suppress alert", "Reduce noise 40%"),
OrchestrationRule("Auto-restart OOM", "event matches 'OOMKilled'",
"Trigger Rundeck restart-pod job", "Auto-fix 80% of OOM incidents"),
OrchestrationRule("Escalate payment P1", "service == 'payment' AND severity == 'critical'",
"Page all levels + create war room", "Fastest response for critical"),
OrchestrationRule("Deduplicate alerts", "Same alert within 5 min",
"Merge into single incident", "Reduce alert fatigue"),
OrchestrationRule("Enrich with context", "Any alert",
"Add runbook URL, recent deploys, dashboard link", "Faster diagnosis"),
]
print("\n=== Orchestration Rules ===")
for r in rules:
print(f" [{r.rule}] Condition: {r.condition}")
print(f" Action: {r.action}")
print(f" Result: {r.result}")
Metrics and Improvement
# === Incident Metrics Dashboard ===
@dataclass
class IncidentMetric:
metric: str
current: str
target: str
trend: str
action: str
metrics = [
IncidentMetric("MTTA (Mean Time to Acknowledge)", "3.2 min", "< 5 min", "↓ improving",
"Maintain current on-call practices"),
IncidentMetric("MTTR (Mean Time to Resolve)", "42 min", "< 30 min", "→ stable",
"Improve runbooks, add auto-remediation"),
IncidentMetric("Incidents per week", "12", "< 8", "↓ improving",
"Fix recurring issues, reduce noise"),
IncidentMetric("P1 incidents per month", "2.5", "< 1", "→ stable",
"Invest in reliability, chaos engineering"),
IncidentMetric("Alert noise ratio", "35%", "< 20%", "↓ improving",
"Tune thresholds, suppress flaky alerts"),
IncidentMetric("On-call interruptions (off-hours)", "8/week", "< 4/week", "↓ improving",
"Auto-remediation, better alerting"),
IncidentMetric("Post-mortem completion rate", "85%", "100% for P1/P2", "↑ improving",
"Enforce post-mortem policy"),
]
print("Incident Metrics:")
for m in metrics:
print(f" [{m.metric}] Current: {m.current} | Target: {m.target} | Trend: {m.trend}")
print(f" Action: {m.action}")
# On-call Health
health = {
"Interruption Score": "3.2/10 (good) — target < 5",
"Sleep Impact": "Low — 90% of pages during business hours",
"Coverage": "100% — no gaps in schedule",
"Handoff Quality": "Good — handoff notes updated 95% of time",
"Burnout Risk": "Low — rotation every week, 4 engineers per team",
}
print(f"\n\nOn-call Health:")
for k, v in health.items():
print(f" [{k}]: {v}")
เคล็ดลับ
- Noise: ลด Alert Noise ด้วย Event Orchestration Suppress Deduplicate
- Runbook: ทุก Service ต้องมี Runbook link อัตโนมัติใน Incident
- Auto-remediation: เริ่มจาก Common Issues เช่น Restart Pod Scale Up
- Post-mortem: บังคับ Post-mortem ทุก P1 P2 แชร์ให้ทั้งทีม
- On-call: ดูแล On-call Health ป้องกัน Burnout หมุนเวียนสม่ำเสมอ
Best Practices สำหรับนักพัฒนา
การเขียนโค้ดที่ดีไม่ใช่แค่ทำให้โปรแกรมทำงานได้ แต่ต้องเขียนให้อ่านง่าย ดูแลรักษาง่าย และ Scale ได้ หลัก SOLID Principles เป็นพื้นฐานสำคัญที่นักพัฒนาทุกู้คืนควรเข้าใจ ได้แก่ Single Responsibility ที่แต่ละ Class ทำหน้าที่เดียว Open-Closed ที่เปิดให้ขยายแต่ปิดการแก้ไข Liskov Substitution ที่ Subclass ต้องใช้แทน Parent ได้ Interface Segregation ที่แยก Interface ให้เล็ก และ Dependency Inversion ที่พึ่งพา Abstraction ไม่ใช่ Implementation
เรื่อง Testing ก็ขาดไม่ได้ ควรเขียน Unit Test ครอบคลุมอย่างน้อย 80% ของ Code Base ใช้ Integration Test ทดสอบการทำงานร่วมกันของ Module ต่างๆ และ E2E Test สำหรับ Critical User Flow เครื่องมือยอดนิยมเช่น Jest, Pytest, JUnit ช่วยให้การเขียน Test เป็นเรื่องง่าย
เรื่อง Version Control ด้วย Git ใช้ Branch Strategy ที่เหมาะกับทีม เช่น Git Flow สำหรับโปรเจคใหญ่ หรือ Trunk-Based Development สำหรับทีมที่ Deploy บ่อย ทำ Code Review ทุก Pull Request และใช้ CI/CD Pipeline ทำ Automated Testing และ Deployment
PagerDuty คืออะไร
Incident Management Alert Prometheus Datadog On-call Phone SMS Push Escalation Policy Service Catalog Timeline Post-mortem Template
Internal Developer Platform คืออะไร
IDP Developer Self-service Service Catalog Infrastructure CI/CD Monitoring Incident Documentation Backstage Port Humanitec Cortex
ตั้ง Escalation Policy อย่างไร
On-call Schedule Weekly Level 1 5 นาที Level 2 Team Lead 15 นาที Level 3 Manager 30 นาที P1 All Levels P2 L1→L2 P3 Slack P4 Ticket
Automation ทำอะไรได้บ้าง
Auto-acknowledge Auto-resolve Auto-remediation Restart Scale Cache JIRA Ticket Status Page Stakeholder Slack Post-mortem Event Orchestration Rule
สรุป
PagerDuty Incident Internal Developer Platform Service Catalog Escalation On-call Automation Event Orchestration Runbook Metrics MTTA MTTR Post-mortem Production
