PagerDuty Incident Internal Developer Platform —

PagerDuty + IDP

PagerDuty Incident Internal Developer Platform Service Catalog Escalation On-call Automation Runbook Backstage Integration Production Operations

Severity	Response SLA	Escalation	Notification	Example
P1 Critical	5 min acknowledge	All levels immediately	Phone + SMS + Push	Service down, data loss
P2 High	15 min acknowledge	Level 1 → Level 2	Phone + Push	Degraded performance
P3 Medium	1 hour acknowledge	Level 1 only	Push + Slack	Non-critical error spike
P4 Low	Next business day	No escalation	Slack + Email	Warning, tech debt

Service Catalog Integration

# === PagerDuty + Backstage Integration ===



# Backstage catalog-info.yaml

# apiVersion: backstage.io/v1alpha1

# kind: Component

# metadata:

#   name: payment-service

#   annotations:

#     pagerduty.com/service-id: PXXXXXX

#     pagerduty.com/integration-key: xxxxxxxx

#   tags: [python, fastapi, payments]

# spec:

#   type: service

#   lifecycle: production

#   owner: team-payments

#   system: e-commerce

#   dependsOn:

#     - component:database-service

#     - component:notification-service



# PagerDuty Service Setup via API

# curl -X POST https://api.pagerduty.com/services \

#   -H "Authorization: Token token=YOUR_API_KEY" \

#   -H "Content-Type: application/json" \

#   -d '{

#     "service": {

#       "name": "Payment Service",

#       "escalation_policy": { "id": "PXXXXXX" },

#       "alert_creation": "create_alerts_and_incidents",

#       "auto_resolve_timeout": 14400,

#       "acknowledgement_timeout": 600

#     }

#   }'



from dataclasses import dataclass



@dataclass

class ServiceEntry:

    service: str

    team: str

    tier: str

    on_call: str

    escalation: str

    runbook: str



services = [

    ServiceEntry("payment-service", "team-payments", "Tier 1",

        "Weekly rotation (4 engineers)", "P1: All levels | P2: L1→L2",

        "runbooks/payment-service.md"),

    ServiceEntry("user-service", "team-identity", "Tier 1",

        "Weekly rotation (3 engineers)", "P1: All levels | P2: L1→L2",

        "runbooks/user-service.md"),

    ServiceEntry("search-service", "team-discovery", "Tier 2",

        "Weekly rotation (3 engineers)", "P1: L1→L2 | P2: L1",

        "runbooks/search-service.md"),

    ServiceEntry("notification-service", "team-comms", "Tier 2",

        "Bi-weekly rotation", "P1: L1→L2 | P2: L1",

        "runbooks/notification-service.md"),

    ServiceEntry("analytics-pipeline", "team-data", "Tier 3",

        "Business hours only", "P1: L1 | P2: Next day",

        "runbooks/analytics.md"),

]



print("=== Service Catalog ===")

for s in services:

    print(f"  [{s.service}] Team: {s.team} | Tier: {s.tier}")

    print(f"    On-call: {s.on_call}")

    print(f"    Escalation: {s.escalation}")

    print(f"    Runbook: {s.runbook}")

Event Orchestration

# === PagerDuty Event Orchestration ===



# Terraform PagerDuty Provider

# resource "pagerduty_event_orchestration_router" "main" {

#   set {

#     id = pagerduty_event_orchestration.main.id

#   }

#   catch_all {

#     actions {

#       route_to = pagerduty_service.default.id

#     }

#   }

#   rule {

#     label = "Route payment alerts"

#     condition {

#       expression = "event.custom_details.service == 'payment'"

#     }

#     actions {

#       route_to = pagerduty_service.payment.id

#       severity = "critical"

#     }

#   }

#   rule {

#     label = "Suppress known flaky alerts"

#     condition {

#       expression = "event.summary matches 'DNS timeout' and event.custom_details.env == 'staging'"

#     }

#     actions {

#       suppress = true

#     }

#   }

# }



# Auto-remediation with Rundeck

# resource "pagerduty_event_orchestration_service" "payment" {

#   rule {

#     label = "Auto-restart on OOM"

#     condition {

#       expression = "event.summary matches 'OOMKilled'"

#     }

#     actions {

#       automation_action {

#         name = "Restart Pod"

#         url = "https://rundeck.internal/api/job/restart-pod"

#         auto_send = true

#       }

#     }

#   }

# }



@dataclass

class OrchestrationRule:

    rule: str

    condition: str

    action: str

    result: str



rules = [

    OrchestrationRule("Route by service", "event.service == 'payment'",

        "Route to payment-service PD service", "Correct team gets alert"),

    OrchestrationRule("Suppress staging", "env == 'staging' AND known_flaky",

        "Suppress alert", "Reduce noise 40%"),

    OrchestrationRule("Auto-restart OOM", "event matches 'OOMKilled'",

        "Trigger Rundeck restart-pod job", "Auto-fix 80% of OOM incidents"),

    OrchestrationRule("Escalate payment P1", "service == 'payment' AND severity == 'critical'",

        "Page all levels + create war room", "Fastest response for critical"),

    OrchestrationRule("Deduplicate alerts", "Same alert within 5 min",

        "Merge into single incident", "Reduce alert fatigue"),

    OrchestrationRule("Enrich with context", "Any alert",

        "Add runbook URL, recent deploys, dashboard link", "Faster diagnosis"),

]



print("\n=== Orchestration Rules ===")

for r in rules:

    print(f"  [{r.rule}] Condition: {r.condition}")

    print(f"    Action: {r.action}")

    print(f"    Result: {r.result}")

Metrics and Improvement

# === Incident Metrics Dashboard ===



@dataclass

class IncidentMetric:

    metric: str

    current: str

    target: str

    trend: str

    action: str



metrics = [

    IncidentMetric("MTTA (Mean Time to Acknowledge)", "3.2 min", "< 5 min", "↓ improving",

        "Maintain current on-call practices"),

    IncidentMetric("MTTR (Mean Time to Resolve)", "42 min", "< 30 min", "→ stable",

        "Improve runbooks, add auto-remediation"),

    IncidentMetric("Incidents per week", "12", "< 8", "↓ improving",

        "Fix recurring issues, reduce noise"),

    IncidentMetric("P1 incidents per month", "2.5", "< 1", "→ stable",

        "Invest in reliability, chaos engineering"),

    IncidentMetric("Alert noise ratio", "35%", "< 20%", "↓ improving",

        "Tune thresholds, suppress flaky alerts"),

    IncidentMetric("On-call interruptions (off-hours)", "8/week", "< 4/week", "↓ improving",

        "Auto-remediation, better alerting"),

    IncidentMetric("Post-mortem completion rate", "85%", "100% for P1/P2", "↑ improving",

        "Enforce post-mortem policy"),

]



print("Incident Metrics:")

for m in metrics:

    print(f"  [{m.metric}] Current: {m.current} | Target: {m.target} | Trend: {m.trend}")

    print(f"    Action: {m.action}")



# On-call Health

health = {

    "Interruption Score": "3.2/10 (good) — target < 5",

    "Sleep Impact": "Low — 90% of pages during business hours",

    "Coverage": "100% — no gaps in schedule",

    "Handoff Quality": "Good — handoff notes updated 95% of time",

    "Burnout Risk": "Low — rotation every week, 4 engineers per team",

}



print(f"\n\nOn-call Health:")

for k, v in health.items():

    print(f"  [{k}]: {v}")

เคล็ดลับ

Noise: ลด Alert Noise ด้วย Event Orchestration Suppress Deduplicate
Runbook: ทุก Service ต้องมี Runbook link อัตโนมัติใน Incident
Auto-remediation: เริ่มจาก Common Issues เช่น Restart Pod Scale Up
Post-mortem: บังคับ Post-mortem ทุก P1 P2 แชร์ให้ทั้งทีม
On-call: ดูแล On-call Health ป้องกัน Burnout หมุนเวียนสม่ำเสมอ

Best Practices สำหรับนักพัฒนา

การเขียนโค้ดที่ดีไม่ใช่แค่ทำให้โปรแกรมทำงานได้ แต่ต้องเขียนให้อ่านง่าย ดูแลรักษาง่าย และ Scale ได้ หลัก SOLID Principles เป็นพื้นฐานสำคัญที่นักพัฒนาทุกคนควรเข้าใจ ได้แก่ Single Responsibility ที่แต่ละ Class ทำหน้าที่เดียว Open-Closed ที่เปิดให้ขยายแต่ปิดการแก้ไข Liskov Substitution ที่ Subclass ต้องใช้แทน Parent ได้ Interface Segregation ที่แยก Interface ให้เล็ก และ Dependency Inversion ที่พึ่งพา Abstraction ไม่ใช่ Implementation

เรื่อง Testing ก็ขาดไม่ได้ ควรเขียน Unit Test ครอบคลุมอย่างน้อย 80% ของ Code Base ใช้ Integration Test ทดสอบการทำงานร่วมกันของ Module ต่างๆ และ E2E Test สำหรับ Critical User Flow เครื่องมือยอดนิยมเช่น Jest, Pytest, JUnit ช่วยให้การเขียน Test เป็นเรื่องง่าย

เนื้อหาเกี่ยวข้อง — บทความที่เกี่ยวข้อง: LLM Quantization GGUF Interview Preparation

เรื่อง Version Control ด้วย Git ใช้ Branch Strategy ที่เหมาะกับทีม เช่น Git Flow สำหรับโปรเจคใหญ่ หรือ Trunk-Based Development สำหรับทีมที่ Deploy บ่อย ทำ Code Review ทุก Pull Request และใช้ CI/CD Pipeline ทำ Automated Testing และ Deployment

แนะนำเพิ่มเติม — ติดตาม XM Signal

PagerDuty คืออะไร

Incident Management Alert Prometheus Datadog On-call Phone SMS Push Escalation Policy Service Catalog Timeline Post-mortem Template

เนื้อหาเกี่ยวข้อง — ดูเพิ่มเติมเรื่อง Semgrep SAST GreenOps Sustainability

Internal Developer Platform คืออะไร

IDP Developer Self-service Service Catalog Infrastructure CI/CD Monitoring Incident Documentation Backstage Port Humanitec Cortex

ตั้ง Escalation Policy อย่างไร

On-call Schedule Weekly Level 1 5 นาที Level 2 Team Lead 15 นาที Level 3 Manager 30 นาที P1 All Levels P2 L1→L2 P3 Slack P4 Ticket

แนะนำเพิ่มเติม — อ่านเพิ่มเติมที่ SiamCafeBook

เนื้อหาเกี่ยวข้อง — ทำความเข้าใจ IS-IS Protocol Shift Left Security

Automation ทำอะไรได้บ้าง

Auto-acknowledge Auto-resolve Auto-remediation Restart Scale Cache JIRA Ticket Status Page Stakeholder Slack Post-mortem Event Orchestration Rule

สรุป

PagerDuty Incident Internal Developer Platform Service Catalog Escalation On-call Automation Event Orchestration Runbook Metrics MTTA MTTR Post-mortem Production

เนื้อหาเกี่ยวข้อง — อ่านต่อ: BigQuery Scheduled Query MLOps Workflow