SiamCafe.net Blog
Technology

PagerDuty Incident Internal Developer Platform

pagerduty incident internal developer platform
PagerDuty Incident Internal Developer Platform | SiamCafe Blog
2025-12-20· อ. บอม — SiamCafe.net· 8,525 คำ

PagerDuty + IDP

PagerDuty Incident Internal Developer Platform Service Catalog Escalation On-call Automation Runbook Backstage Integration Production Operations

SeverityResponse SLAEscalationNotificationExample
P1 Critical5 min acknowledgeAll levels immediatelyPhone + SMS + PushService down, data loss
P2 High15 min acknowledgeLevel 1 → Level 2Phone + PushDegraded performance
P3 Medium1 hour acknowledgeLevel 1 onlyPush + SlackNon-critical error spike
P4 LowNext business dayNo escalationSlack + EmailWarning, tech debt

Service Catalog Integration

# === PagerDuty + Backstage Integration ===

# Backstage catalog-info.yaml
# apiVersion: backstage.io/v1alpha1
# kind: Component
# metadata:
#   name: payment-service
#   annotations:
#     pagerduty.com/service-id: PXXXXXX
#     pagerduty.com/integration-key: xxxxxxxx
#   tags: [python, fastapi, payments]
# spec:
#   type: service
#   lifecycle: production
#   owner: team-payments
#   system: e-commerce
#   dependsOn:
#     - component:database-service
#     - component:notification-service

# PagerDuty Service Setup via API
# curl -X POST https://api.pagerduty.com/services \
#   -H "Authorization: Token token=YOUR_API_KEY" \
#   -H "Content-Type: application/json" \
#   -d '{
#     "service": {
#       "name": "Payment Service",
#       "escalation_policy": { "id": "PXXXXXX" },
#       "alert_creation": "create_alerts_and_incidents",
#       "auto_resolve_timeout": 14400,
#       "acknowledgement_timeout": 600
#     }
#   }'

from dataclasses import dataclass

@dataclass
class ServiceEntry:
    service: str
    team: str
    tier: str
    on_call: str
    escalation: str
    runbook: str

services = [
    ServiceEntry("payment-service", "team-payments", "Tier 1",
        "Weekly rotation (4 engineers)", "P1: All levels | P2: L1→L2",
        "runbooks/payment-service.md"),
    ServiceEntry("user-service", "team-identity", "Tier 1",
        "Weekly rotation (3 engineers)", "P1: All levels | P2: L1→L2",
        "runbooks/user-service.md"),
    ServiceEntry("search-service", "team-discovery", "Tier 2",
        "Weekly rotation (3 engineers)", "P1: L1→L2 | P2: L1",
        "runbooks/search-service.md"),
    ServiceEntry("notification-service", "team-comms", "Tier 2",
        "Bi-weekly rotation", "P1: L1→L2 | P2: L1",
        "runbooks/notification-service.md"),
    ServiceEntry("analytics-pipeline", "team-data", "Tier 3",
        "Business hours only", "P1: L1 | P2: Next day",
        "runbooks/analytics.md"),
]

print("=== Service Catalog ===")
for s in services:
    print(f"  [{s.service}] Team: {s.team} | Tier: {s.tier}")
    print(f"    On-call: {s.on_call}")
    print(f"    Escalation: {s.escalation}")
    print(f"    Runbook: {s.runbook}")

Event Orchestration

# === PagerDuty Event Orchestration ===

# Terraform PagerDuty Provider
# resource "pagerduty_event_orchestration_router" "main" {
#   set {
#     id = pagerduty_event_orchestration.main.id
#   }
#   catch_all {
#     actions {
#       route_to = pagerduty_service.default.id
#     }
#   }
#   rule {
#     label = "Route payment alerts"
#     condition {
#       expression = "event.custom_details.service == 'payment'"
#     }
#     actions {
#       route_to = pagerduty_service.payment.id
#       severity = "critical"
#     }
#   }
#   rule {
#     label = "Suppress known flaky alerts"
#     condition {
#       expression = "event.summary matches 'DNS timeout' and event.custom_details.env == 'staging'"
#     }
#     actions {
#       suppress = true
#     }
#   }
# }

# Auto-remediation with Rundeck
# resource "pagerduty_event_orchestration_service" "payment" {
#   rule {
#     label = "Auto-restart on OOM"
#     condition {
#       expression = "event.summary matches 'OOMKilled'"
#     }
#     actions {
#       automation_action {
#         name = "Restart Pod"
#         url = "https://rundeck.internal/api/job/restart-pod"
#         auto_send = true
#       }
#     }
#   }
# }

@dataclass
class OrchestrationRule:
    rule: str
    condition: str
    action: str
    result: str

rules = [
    OrchestrationRule("Route by service", "event.service == 'payment'",
        "Route to payment-service PD service", "Correct team gets alert"),
    OrchestrationRule("Suppress staging", "env == 'staging' AND known_flaky",
        "Suppress alert", "Reduce noise 40%"),
    OrchestrationRule("Auto-restart OOM", "event matches 'OOMKilled'",
        "Trigger Rundeck restart-pod job", "Auto-fix 80% of OOM incidents"),
    OrchestrationRule("Escalate payment P1", "service == 'payment' AND severity == 'critical'",
        "Page all levels + create war room", "Fastest response for critical"),
    OrchestrationRule("Deduplicate alerts", "Same alert within 5 min",
        "Merge into single incident", "Reduce alert fatigue"),
    OrchestrationRule("Enrich with context", "Any alert",
        "Add runbook URL, recent deploys, dashboard link", "Faster diagnosis"),
]

print("\n=== Orchestration Rules ===")
for r in rules:
    print(f"  [{r.rule}] Condition: {r.condition}")
    print(f"    Action: {r.action}")
    print(f"    Result: {r.result}")

Metrics and Improvement

# === Incident Metrics Dashboard ===

@dataclass
class IncidentMetric:
    metric: str
    current: str
    target: str
    trend: str
    action: str

metrics = [
    IncidentMetric("MTTA (Mean Time to Acknowledge)", "3.2 min", "< 5 min", "↓ improving",
        "Maintain current on-call practices"),
    IncidentMetric("MTTR (Mean Time to Resolve)", "42 min", "< 30 min", "→ stable",
        "Improve runbooks, add auto-remediation"),
    IncidentMetric("Incidents per week", "12", "< 8", "↓ improving",
        "Fix recurring issues, reduce noise"),
    IncidentMetric("P1 incidents per month", "2.5", "< 1", "→ stable",
        "Invest in reliability, chaos engineering"),
    IncidentMetric("Alert noise ratio", "35%", "< 20%", "↓ improving",
        "Tune thresholds, suppress flaky alerts"),
    IncidentMetric("On-call interruptions (off-hours)", "8/week", "< 4/week", "↓ improving",
        "Auto-remediation, better alerting"),
    IncidentMetric("Post-mortem completion rate", "85%", "100% for P1/P2", "↑ improving",
        "Enforce post-mortem policy"),
]

print("Incident Metrics:")
for m in metrics:
    print(f"  [{m.metric}] Current: {m.current} | Target: {m.target} | Trend: {m.trend}")
    print(f"    Action: {m.action}")

# On-call Health
health = {
    "Interruption Score": "3.2/10 (good) — target < 5",
    "Sleep Impact": "Low — 90% of pages during business hours",
    "Coverage": "100% — no gaps in schedule",
    "Handoff Quality": "Good — handoff notes updated 95% of time",
    "Burnout Risk": "Low — rotation every week, 4 engineers per team",
}

print(f"\n\nOn-call Health:")
for k, v in health.items():
    print(f"  [{k}]: {v}")

เคล็ดลับ

Best Practices สำหรับนักพัฒนา

การเขียนโค้ดที่ดีไม่ใช่แค่ทำให้โปรแกรมทำงานได้ แต่ต้องเขียนให้อ่านง่าย ดูแลรักษาง่าย และ Scale ได้ หลัก SOLID Principles เป็นพื้นฐานสำคัญที่นักพัฒนาทุกู้คืนควรเข้าใจ ได้แก่ Single Responsibility ที่แต่ละ Class ทำหน้าที่เดียว Open-Closed ที่เปิดให้ขยายแต่ปิดการแก้ไข Liskov Substitution ที่ Subclass ต้องใช้แทน Parent ได้ Interface Segregation ที่แยก Interface ให้เล็ก และ Dependency Inversion ที่พึ่งพา Abstraction ไม่ใช่ Implementation

เรื่อง Testing ก็ขาดไม่ได้ ควรเขียน Unit Test ครอบคลุมอย่างน้อย 80% ของ Code Base ใช้ Integration Test ทดสอบการทำงานร่วมกันของ Module ต่างๆ และ E2E Test สำหรับ Critical User Flow เครื่องมือยอดนิยมเช่น Jest, Pytest, JUnit ช่วยให้การเขียน Test เป็นเรื่องง่าย

เรื่อง Version Control ด้วย Git ใช้ Branch Strategy ที่เหมาะกับทีม เช่น Git Flow สำหรับโปรเจคใหญ่ หรือ Trunk-Based Development สำหรับทีมที่ Deploy บ่อย ทำ Code Review ทุก Pull Request และใช้ CI/CD Pipeline ทำ Automated Testing และ Deployment

PagerDuty คืออะไร

Incident Management Alert Prometheus Datadog On-call Phone SMS Push Escalation Policy Service Catalog Timeline Post-mortem Template

Internal Developer Platform คืออะไร

IDP Developer Self-service Service Catalog Infrastructure CI/CD Monitoring Incident Documentation Backstage Port Humanitec Cortex

ตั้ง Escalation Policy อย่างไร

On-call Schedule Weekly Level 1 5 นาที Level 2 Team Lead 15 นาที Level 3 Manager 30 นาที P1 All Levels P2 L1→L2 P3 Slack P4 Ticket

Automation ทำอะไรได้บ้าง

Auto-acknowledge Auto-resolve Auto-remediation Restart Scale Cache JIRA Ticket Status Page Stakeholder Slack Post-mortem Event Orchestration Rule

สรุป

PagerDuty Incident Internal Developer Platform Service Catalog Escalation On-call Automation Event Orchestration Runbook Metrics MTTA MTTR Post-mortem Production

📖 บทความที่เกี่ยวข้อง

Elasticsearch OpenSearch Internal Developer Platformอ่านบทความ → PagerDuty Incident Chaos Engineeringอ่านบทความ → PagerDuty Incident Testing Strategy QAอ่านบทความ → PagerDuty Incident Certification Pathอ่านบทความ → PagerDuty Incident Freelance IT Careerอ่านบทความ →

📚 ดูบทความทั้งหมด →