PagerDuty Incident Internal Developer Platform —
PagerDuty + IDP

PagerDuty Incident Internal Developer Platform Service Catalog Escalation On-call Automation Runbook Backstage Integration Production Operations
| Severity | Response SLA | Escalation | Notification | Example |
|---|---|---|---|---|
| P1 Critical | 5 min acknowledge | All levels immediately | Phone + SMS + Push | Service down, data loss |
| P2 High | 15 min acknowledge | Level 1 → Level 2 | Phone + Push | Degraded performance |
| P3 Medium | 1 hour acknowledge | Level 1 only | Push + Slack | Non-critical error spike |
| P4 Low | Next business day | No escalation | Slack + Email | Warning, tech debt |
Service Catalog Integration
# === PagerDuty + Backstage Integration ===
# Backstage catalog-info.yaml
# apiVersion: backstage.io/v1alpha1
# kind: Component
# metadata:
# name: payment-service
# annotations:
# pagerduty.com/service-id: PXXXXXX
# pagerduty.com/integration-key: xxxxxxxx
# tags: [python, fastapi, payments]
# spec:
# type: service
# lifecycle: production
# owner: team-payments
# system: e-commerce
# dependsOn:
# - component:database-service
# - component:notification-service
# PagerDuty Service Setup via API
# curl -X POST https://api.pagerduty.com/services \
# -H "Authorization: Token token=YOUR_API_KEY" \
# -H "Content-Type: application/json" \
# -d '{
# "service": {
# "name": "Payment Service",
# "escalation_policy": { "id": "PXXXXXX" },
# "alert_creation": "create_alerts_and_incidents",
# "auto_resolve_timeout": 14400,
# "acknowledgement_timeout": 600
# }
# }'
from dataclasses import dataclass
@dataclass
class ServiceEntry:
service: str
team: str
tier: str
on_call: str
escalation: str
runbook: str
services = [
ServiceEntry("payment-service", "team-payments", "Tier 1",
"Weekly rotation (4 engineers)", "P1: All levels | P2: L1→L2",
"runbooks/payment-service.md"),
ServiceEntry("user-service", "team-identity", "Tier 1",
"Weekly rotation (3 engineers)", "P1: All levels | P2: L1→L2",
"runbooks/user-service.md"),
ServiceEntry("search-service", "team-discovery", "Tier 2",
"Weekly rotation (3 engineers)", "P1: L1→L2 | P2: L1",
"runbooks/search-service.md"),
ServiceEntry("notification-service", "team-comms", "Tier 2",
"Bi-weekly rotation", "P1: L1→L2 | P2: L1",
"runbooks/notification-service.md"),
ServiceEntry("analytics-pipeline", "team-data", "Tier 3",
"Business hours only", "P1: L1 | P2: Next day",
"runbooks/analytics.md"),
]
print("=== Service Catalog ===")
for s in services:
print(f" [{s.service}] Team: {s.team} | Tier: {s.tier}")
print(f" On-call: {s.on_call}")
print(f" Escalation: {s.escalation}")
print(f" Runbook: {s.runbook}")
Event Orchestration
# === PagerDuty Event Orchestration ===
# Terraform PagerDuty Provider
# resource "pagerduty_event_orchestration_router" "main" {
# set {
# id = pagerduty_event_orchestration.main.id
# }
# catch_all {
# actions {
# route_to = pagerduty_service.default.id
# }
# }
# rule {
# label = "Route payment alerts"
# condition {
# expression = "event.custom_details.service == 'payment'"
# }
# actions {
# route_to = pagerduty_service.payment.id
# severity = "critical"
# }
# }
# rule {
# label = "Suppress known flaky alerts"
# condition {
# expression = "event.summary matches 'DNS timeout' and event.custom_details.env == 'staging'"
# }
# actions {
# suppress = true
# }
# }
# }
# Auto-remediation with Rundeck
# resource "pagerduty_event_orchestration_service" "payment" {
# rule {
# label = "Auto-restart on OOM"
# condition {
# expression = "event.summary matches 'OOMKilled'"
# }
# actions {
# automation_action {
# name = "Restart Pod"
# url = "https://rundeck.internal/api/job/restart-pod"
# auto_send = true
# }
# }
# }
# }
@dataclass
class OrchestrationRule:
rule: str
condition: str
action: str
result: str
rules = [
OrchestrationRule("Route by service", "event.service == 'payment'",
"Route to payment-service PD service", "Correct team gets alert"),
OrchestrationRule("Suppress staging", "env == 'staging' AND known_flaky",
"Suppress alert", "Reduce noise 40%"),
OrchestrationRule("Auto-restart OOM", "event matches 'OOMKilled'",
"Trigger Rundeck restart-pod job", "Auto-fix 80% of OOM incidents"),
OrchestrationRule("Escalate payment P1", "service == 'payment' AND severity == 'critical'",
"Page all levels + create war room", "Fastest response for critical"),
OrchestrationRule("Deduplicate alerts", "Same alert within 5 min",
"Merge into single incident", "Reduce alert fatigue"),
OrchestrationRule("Enrich with context", "Any alert",
"Add runbook URL, recent deploys, dashboard link", "Faster diagnosis"),
]
print("\n=== Orchestration Rules ===")
for r in rules:
print(f" [{r.rule}] Condition: {r.condition}")
print(f" Action: {r.action}")
print(f" Result: {r.result}")
Metrics and Improvement
# === Incident Metrics Dashboard ===
@dataclass
class IncidentMetric:
metric: str
current: str
target: str
trend: str
action: str
metrics = [
IncidentMetric("MTTA (Mean Time to Acknowledge)", "3.2 min", "< 5 min", "↓ improving",
"Maintain current on-call practices"),
IncidentMetric("MTTR (Mean Time to Resolve)", "42 min", "< 30 min", "→ stable",
"Improve runbooks, add auto-remediation"),
IncidentMetric("Incidents per week", "12", "< 8", "↓ improving",
"Fix recurring issues, reduce noise"),
IncidentMetric("P1 incidents per month", "2.5", "< 1", "→ stable",
"Invest in reliability, chaos engineering"),
IncidentMetric("Alert noise ratio", "35%", "< 20%", "↓ improving",
"Tune thresholds, suppress flaky alerts"),
IncidentMetric("On-call interruptions (off-hours)", "8/week", "< 4/week", "↓ improving",
"Auto-remediation, better alerting"),
IncidentMetric("Post-mortem completion rate", "85%", "100% for P1/P2", "↑ improving",
"Enforce post-mortem policy"),
]
print("Incident Metrics:")
for m in metrics:
print(f" [{m.metric}] Current: {m.current} | Target: {m.target} | Trend: {m.trend}")
print(f" Action: {m.action}")
# On-call Health
health = {
"Interruption Score": "3.2/10 (good) — target < 5",
"Sleep Impact": "Low — 90% of pages during business hours",
"Coverage": "100% — no gaps in schedule",
"Handoff Quality": "Good — handoff notes updated 95% of time",
"Burnout Risk": "Low — rotation every week, 4 engineers per team",
}
print(f"\n\nOn-call Health:")
for k, v in health.items():
print(f" [{k}]: {v}")
เคล็ดลับ
- Noise: ลด Alert Noise ด้วย Event Orchestration Suppress Deduplicate
- Runbook: ทุก Service ต้องมี Runbook link อัตโนมัติใน Incident
- Auto-remediation: เริ่มจาก Common Issues เช่น Restart Pod Scale Up
- Post-mortem: บังคับ Post-mortem ทุก P1 P2 แชร์ให้ทั้งทีม
- On-call: ดูแล On-call Health ป้องกัน Burnout หมุนเวียนสม่ำเสมอ
Best Practices สำหรับนักพัฒนา

การเขียนโค้ดที่ดีไม่ใช่แค่ทำให้โปรแกรมทำงานได้ แต่ต้องเขียนให้อ่านง่าย ดูแลรักษาง่าย และ Scale ได้ หลัก SOLID Principles เป็นพื้นฐานสำคัญที่นักพัฒนาทุกคนควรเข้าใจ ได้แก่ Single Responsibility ที่แต่ละ Class ทำหน้าที่เดียว Open-Closed ที่เปิดให้ขยายแต่ปิดการแก้ไข Liskov Substitution ที่ Subclass ต้องใช้แทน Parent ได้ Interface Segregation ที่แยก Interface ให้เล็ก และ Dependency Inversion ที่พึ่งพา Abstraction ไม่ใช่ Implementation
เรื่อง Testing ก็ขาดไม่ได้ ควรเขียน Unit Test ครอบคลุมอย่างน้อย 80% ของ Code Base ใช้ Integration Test ทดสอบการทำงานร่วมกันของ Module ต่างๆ และ E2E Test สำหรับ Critical User Flow เครื่องมือยอดนิยมเช่น Jest, Pytest, JUnit ช่วยให้การเขียน Test เป็นเรื่องง่าย
เนื้อหาเกี่ยวข้อง — บทความที่เกี่ยวข้อง: LLM Quantization GGUF Interview Preparation
เรื่อง Version Control ด้วย Git ใช้ Branch Strategy ที่เหมาะกับทีม เช่น Git Flow สำหรับโปรเจคใหญ่ หรือ Trunk-Based Development สำหรับทีมที่ Deploy บ่อย ทำ Code Review ทุก Pull Request และใช้ CI/CD Pipeline ทำ Automated Testing และ Deployment
แนะนำเพิ่มเติม — ติดตาม XM Signal
PagerDuty คืออะไร
Incident Management Alert Prometheus Datadog On-call Phone SMS Push Escalation Policy Service Catalog Timeline Post-mortem Template
เนื้อหาเกี่ยวข้อง — ดูเพิ่มเติมเรื่อง Semgrep SAST GreenOps Sustainability
Internal Developer Platform คืออะไร
IDP Developer Self-service Service Catalog Infrastructure CI/CD Monitoring Incident Documentation Backstage Port Humanitec Cortex
ตั้ง Escalation Policy อย่างไร
On-call Schedule Weekly Level 1 5 นาที Level 2 Team Lead 15 นาที Level 3 Manager 30 นาที P1 All Levels P2 L1→L2 P3 Slack P4 Ticket
แนะนำเพิ่มเติม — อ่านเพิ่มเติมที่ SiamCafeBook
เนื้อหาเกี่ยวข้อง — ทำความเข้าใจ IS-IS Protocol Shift Left Security
Automation ทำอะไรได้บ้าง
Auto-acknowledge Auto-resolve Auto-remediation Restart Scale Cache JIRA Ticket Status Page Stakeholder Slack Post-mortem Event Orchestration Rule
สรุป
PagerDuty Incident Internal Developer Platform Service Catalog Escalation On-call Automation Event Orchestration Runbook Metrics MTTA MTTR Post-mortem Production
เนื้อหาเกี่ยวข้อง — อ่านต่อ: BigQuery Scheduled Query MLOps Workflow





