SiamCafe.net Blog
Cybersecurity

PagerDuty Incident Monitoring และ Alerting จัดการ Incidents อย่างมีออาชพ

pagerduty incident monitoring และ alerting
PagerDuty Incident Monitoring และ Alerting | SiamCafe Blog
2025-06-02· อ. บอม — SiamCafe.net· 1,371 คำ

PagerDuty ?????????????????????

PagerDuty ???????????? incident management platform ?????????????????????????????? DevOps ????????? SRE ?????????????????? incidents ??????????????????????????????????????? alert, escalation, response ??????????????? resolution ??????????????????????????? central hub ?????????????????? alerts ????????? monitoring tools ????????????????????? ???????????? route ?????????????????????????????????????????????????????????????????? on-call schedule

Core features ????????? PagerDuty ?????????????????? Intelligent Alert Grouping ????????? alerts ??????????????????????????????????????????????????? incident ??????????????? ?????? alert noise, On-call Management ???????????????????????? on-call ??????????????????????????? rotate ?????????, Escalation Policies ?????????????????????????????????????????? escalate ??????????????????????????????????????????????????????, Automation Actions ????????? scripts ?????????????????????????????????????????????????????? incident, Analytics ????????? MTTA (Mean Time to Acknowledge) ????????? MTTR (Mean Time to Resolve)

PagerDuty ?????????????????? integrations ????????????????????? 700 tools ???????????? Prometheus, Grafana, AWS CloudWatch, Datadog, New Relic, Slack, Jira ??????????????????????????? single pane of glass ?????????????????? incident management

??????????????????????????????????????????????????? PagerDuty

Setup PagerDuty ???????????????????????????

# === PagerDuty Setup ===

# 1. Create Service via API
curl -X POST "https://api.pagerduty.com/services" \
  -H "Authorization: Token token=YOUR_API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "service": {
      "name": "Production API",
      "description": "Main production API service",
      "escalation_policy": {
        "id": "POLICY_ID",
        "type": "escalation_policy_reference"
      },
      "alert_creation": "create_alerts_and_incidents",
      "alert_grouping_parameters": {
        "type": "intelligent"
      },
      "incident_urgency_rule": {
        "type": "constant",
        "urgency": "high"
      },
      "auto_resolve_timeout": 14400,
      "acknowledgement_timeout": 1800
    }
  }'

# 2. Create Integration Key
curl -X POST "https://api.pagerduty.com/services/SERVICE_ID/integrations" \
  -H "Authorization: Token token=YOUR_API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "integration": {
      "type": "events_api_v2_inbound_integration",
      "name": "Prometheus Integration"
    }
  }'

# 3. Send Test Event (Events API v2)
curl -X POST "https://events.pagerduty.com/v2/enqueue" \
  -H "Content-Type: application/json" \
  -d '{
    "routing_key": "YOUR_INTEGRATION_KEY",
    "event_action": "trigger",
    "dedup_key": "test-incident-001",
    "payload": {
      "summary": "CPU usage above 90% on prod-api-01",
      "source": "prod-api-01",
      "severity": "critical",
      "component": "api-server",
      "group": "production",
      "class": "cpu",
      "custom_details": {
        "cpu_percent": 95.2,
        "load_average": 8.5,
        "memory_percent": 78.3
      }
    },
    "links": [
      {"href": "https://grafana.example.com/d/xxx", "text": "Grafana Dashboard"}
    ]
  }'

# 4. Resolve Event
curl -X POST "https://events.pagerduty.com/v2/enqueue" \
  -H "Content-Type: application/json" \
  -d '{
    "routing_key": "YOUR_INTEGRATION_KEY",
    "event_action": "resolve",
    "dedup_key": "test-incident-001"
  }'

echo "PagerDuty configured"

??????????????? Escalation Policies ????????? Schedules

???????????????????????? on-call ????????? escalation

#!/usr/bin/env python3
# pagerduty_setup.py ??? PagerDuty Configuration
import json
import logging
import requests
from datetime import datetime

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger("pagerduty")

class PagerDutyManager:
    def __init__(self, api_token):
        self.base_url = "https://api.pagerduty.com"
        self.headers = {
            "Authorization": f"Token token={api_token}",
            "Content-Type": "application/json",
        }
    
    def create_schedule(self, name, users, rotation_days=7):
        """Create on-call rotation schedule"""
        schedule = {
            "schedule": {
                "name": name,
                "type": "schedule",
                "time_zone": "Asia/Bangkok",
                "schedule_layers": [
                    {
                        "name": "Primary On-Call",
                        "start": datetime.utcnow().isoformat() + "Z",
                        "rotation_virtual_start": datetime.utcnow().isoformat() + "Z",
                        "rotation_turn_length_seconds": rotation_days * 86400,
                        "users": [{"user": {"id": uid, "type": "user_reference"}} for uid in users],
                    },
                ],
            },
        }
        resp = requests.post(f"{self.base_url}/schedules", headers=self.headers, json=schedule)
        return resp.json()
    
    def create_escalation_policy(self, name, schedule_ids, escalation_minutes=30):
        """Create escalation policy"""
        policy = {
            "escalation_policy": {
                "name": name,
                "type": "escalation_policy",
                "escalation_rules": [
                    {
                        "escalation_delay_in_minutes": escalation_minutes,
                        "targets": [
                            {"id": sid, "type": "schedule_reference"} for sid in schedule_ids
                        ],
                    },
                    {
                        "escalation_delay_in_minutes": 15,
                        "targets": [
                            {"id": "MANAGER_USER_ID", "type": "user_reference"},
                        ],
                    },
                ],
                "repeat_enabled": True,
                "num_loops": 3,
            },
        }
        resp = requests.post(f"{self.base_url}/escalation_policies", headers=self.headers, json=policy)
        return resp.json()
    
    def get_oncall(self):
        """Get current on-call users"""
        resp = requests.get(f"{self.base_url}/oncalls", headers=self.headers, params={"limit": 25})
        oncalls = resp.json().get("oncalls", [])
        result = []
        for oc in oncalls:
            result.append({
                "user": oc.get("user", {}).get("summary"),
                "schedule": oc.get("schedule", {}).get("summary"),
                "escalation_level": oc.get("escalation_level"),
                "start": oc.get("start"),
                "end": oc.get("end"),
            })
        return result
    
    def get_incidents(self, status="triggered, acknowledged"):
        """Get active incidents"""
        resp = requests.get(
            f"{self.base_url}/incidents",
            headers=self.headers,
            params={"statuses[]": status.split(","), "limit": 25},
        )
        return resp.json().get("incidents", [])

pd = PagerDutyManager("demo_token")
print("PagerDuty Manager initialized")
print("Methods: create_schedule, create_escalation_policy, get_oncall, get_incidents")

Integration ????????? Monitoring Tools

??????????????????????????? PagerDuty ????????? monitoring stack

# === PagerDuty Integrations ===

# 1. Prometheus Alertmanager ??? PagerDuty
cat > /etc/alertmanager/alertmanager.yml << 'EOF'
global:
  resolve_timeout: 5m
  pagerduty_url: "https://events.pagerduty.com/v2/enqueue"

route:
  receiver: "pagerduty-critical"
  group_by: ["alertname", "cluster"]
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  routes:
    - match:
        severity: critical
      receiver: "pagerduty-critical"
    - match:
        severity: warning
      receiver: "pagerduty-warning"
    - match:
        severity: info
      receiver: "slack-info"

receivers:
  - name: "pagerduty-critical"
    pagerduty_configs:
      - routing_key: "CRITICAL_INTEGRATION_KEY"
        severity: critical
        description: "{{ .GroupLabels.alertname }}: {{ .CommonAnnotations.summary }}"
        details:
          firing: "{{ .Alerts.Firing | len }}"
          resolved: "{{ .Alerts.Resolved | len }}"
          cluster: "{{ .GroupLabels.cluster }}"

  - name: "pagerduty-warning"
    pagerduty_configs:
      - routing_key: "WARNING_INTEGRATION_KEY"
        severity: warning

  - name: "slack-info"
    slack_configs:
      - channel: "#monitoring"
        text: "{{ .CommonAnnotations.summary }}"
EOF

# 2. Grafana ??? PagerDuty
# In Grafana UI:
# Settings ??? Notification Channels ??? Add Channel
# Type: PagerDuty
# Integration Key: YOUR_INTEGRATION_KEY
# Severity: critical/warning/info
# Auto-resolve: enabled

# 3. AWS CloudWatch ??? PagerDuty
# Use CloudWatch ??? SNS ??? PagerDuty integration
# aws sns create-topic --name pagerduty-alerts
# Subscribe PagerDuty endpoint to SNS topic

# 4. Custom Application ??? PagerDuty (Python)
cat > send_alert.py << 'PYEOF'
#!/usr/bin/env python3
import requests
import json

def send_pagerduty_alert(routing_key, summary, severity="critical", source="app", dedup_key=None):
    payload = {
        "routing_key": routing_key,
        "event_action": "trigger",
        "payload": {
            "summary": summary,
            "source": source,
            "severity": severity,
        },
    }
    if dedup_key:
        payload["dedup_key"] = dedup_key
    
    resp = requests.post("https://events.pagerduty.com/v2/enqueue", json=payload)
    return resp.json()

# Usage:
# send_pagerduty_alert("ROUTING_KEY", "Database connection pool exhausted", "critical", "db-pool-monitor")
PYEOF

echo "Integrations configured"

Incident Response Automation

Automate incident response

#!/usr/bin/env python3
# incident_automation.py ??? Incident Response Automation
import json
import logging
from datetime import datetime
from typing import Dict, List

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger("automation")

class IncidentAutomation:
    def __init__(self):
        self.runbooks = {}
    
    def auto_remediation_rules(self):
        return {
            "high_cpu": {
                "trigger": "CPU usage > 90% for 5 minutes",
                "actions": [
                    "1. Scale up: kubectl scale deployment/api --replicas=+2",
                    "2. Check top processes: ssh node 'ps aux --sort=-%cpu | head -20'",
                    "3. Alert on-call if still high after 10 minutes",
                ],
                "automation_script": "scripts/remediate_high_cpu.sh",
                "auto_resolve": True,
            },
            "disk_full": {
                "trigger": "Disk usage > 90%",
                "actions": [
                    "1. Clean old logs: find /var/log -mtime +7 -delete",
                    "2. Clean Docker: docker system prune -f",
                    "3. Clean temp files: rm -rf /tmp/cache/*",
                    "4. Alert if still > 85% after cleanup",
                ],
                "automation_script": "scripts/remediate_disk_full.sh",
                "auto_resolve": True,
            },
            "service_down": {
                "trigger": "Health check fails 3 consecutive times",
                "actions": [
                    "1. Restart service: systemctl restart myapp",
                    "2. Check logs: journalctl -u myapp --since '5 min ago'",
                    "3. If restart fails, failover to standby",
                    "4. Page on-call engineer",
                ],
                "automation_script": "scripts/remediate_service_down.sh",
                "auto_resolve": False,
            },
            "ssl_expiry": {
                "trigger": "SSL certificate expires in < 7 days",
                "actions": [
                    "1. Renew certificate: certbot renew",
                    "2. Reload nginx: nginx -s reload",
                    "3. Verify: openssl s_client -connect host:443",
                ],
                "automation_script": "scripts/renew_ssl.sh",
                "auto_resolve": True,
            },
        }
    
    def incident_metrics(self):
        return {
            "period": "last_30_days",
            "total_incidents": 156,
            "by_severity": {"critical": 12, "high": 34, "warning": 78, "info": 32},
            "mtta_minutes": 3.5,
            "mttr_minutes": 28.4,
            "auto_resolved_pct": 45,
            "escalated_pct": 15,
            "top_services": [
                {"service": "Production API", "incidents": 42, "mttr": 22},
                {"service": "Payment Gateway", "incidents": 28, "mttr": 35},
                {"service": "Database Cluster", "incidents": 18, "mttr": 45},
            ],
            "noise_reduction": {
                "total_alerts": 2400,
                "grouped_incidents": 156,
                "noise_reduction_pct": 93.5,
            },
        }

automation = IncidentAutomation()
rules = automation.auto_remediation_rules()
for name, rule in rules.items():
    print(f"{name}: {rule['trigger']}")
    print(f"  Auto-resolve: {rule['auto_resolve']}")

metrics = automation.incident_metrics()
print(f"\nMTTA: {metrics['mtta_minutes']} min, MTTR: {metrics['mttr_minutes']} min")
print(f"Noise reduction: {metrics['noise_reduction']['noise_reduction_pct']}%")

Analytics ????????? Post-Incident Review

??????????????????????????? incidents ??????????????? post-mortem

# === Post-Incident Review Template ===

cat > post_incident_template.md << 'EOF'
# Post-Incident Review: [Incident Title]

## Summary
- **Date**: YYYY-MM-DD HH:MM - HH:MM (UTC+7)
- **Duration**: X hours Y minutes
- **Severity**: Critical / High / Medium
- **Impact**: X users affected, Y% error rate
- **Services**: [affected services]

## Timeline
| Time | Event |
|------|-------|
| HH:MM | Alert triggered: [description] |
| HH:MM | On-call acknowledged |
| HH:MM | Root cause identified |
| HH:MM | Mitigation applied |
| HH:MM | Service restored |
| HH:MM | Incident resolved |

## Root Cause
[Detailed explanation of what caused the incident]

## Impact
- Users affected: X
- Revenue impact: $Y
- SLA impact: Z minutes of downtime

## What Went Well
- Alert fired within 30 seconds
- On-call responded in 3 minutes
- Runbook was accurate and helpful

## What Went Wrong
- Initial diagnosis was incorrect
- Escalation took too long
- Communication to stakeholders was delayed

## Action Items
| Action | Owner | Due Date | Status |
|--------|-------|----------|--------|
| Add monitoring for X | @engineer | YYYY-MM-DD | Pending |
| Update runbook for Y | @sre | YYYY-MM-DD | Pending |
| Fix root cause Z | @team | YYYY-MM-DD | Pending |

## Lessons Learned
[Key takeaways for the team]
EOF

# Prometheus Alert Rules for PagerDuty
cat > /etc/prometheus/alerts/pagerduty.yml << 'EOF'
groups:
  - name: pagerduty_alerts
    rules:
      - alert: HighErrorRate
        expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate: {{ $value | humanizePercentage }}"

      - alert: HighLatency
        expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 1
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "P95 latency > 1s: {{ $value }}s"

      - alert: PodCrashLooping
        expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
        for: 15m
        labels:
          severity: critical
        annotations:
          summary: "Pod {{ $labels.pod }} crash looping"
EOF

echo "Analytics and post-incident review configured"

FAQ ??????????????????????????????????????????

Q: PagerDuty ????????? OpsGenie ???????????????????????????????????????????

A: PagerDuty ???????????? market leader ?????? features ??????????????????????????? intelligent alert grouping ??????????????? integrations ?????????????????????????????? (700+) analytics ????????????????????? ????????????????????????????????? ??????????????? enterprise OpsGenie (Atlassian) ????????????????????????????????? integrate ????????? Jira/Confluence ????????????????????????????????? features ??????????????????????????? incident management ???????????????????????????????????????????????????????????? Atlassian ecosystem ???????????????????????? ???????????????????????????????????????????????? on-call scheduling, escalation, alerting ??????????????? ???????????????????????? budget ????????? ecosystem ??????????????????

Q: Alert fatigue ??????????????????????????????????????? PagerDuty?

A: PagerDuty ???????????????????????????????????? alert fatigue Intelligent Alert Grouping ????????? alerts ??????????????????????????????????????????????????? incident ??????????????? ?????? noise 90%+ Event Rules ??????????????? rules suppress, route ???????????? transform alerts ??????????????????????????? incident Urgency Settings ????????? high/low urgency low urgency ????????? page ?????????????????? Transient Alerts suppress alerts ????????? resolve ???????????????????????? X ???????????? Service Dependencies alert ??????????????? root cause ????????? alert ????????? dependent service ?????????????????????????????????????????? PagerDuty ?????? alert noise ??????????????????????????? 90%

Q: On-call schedule ????????????????????????????????????????

A: Best practices ?????????????????? on-call Rotation 1 ???????????????????????????????????? (????????????????????? 2 ?????????????????????) ???????????????????????????????????? 4-5 ?????? ?????????????????????????????? on-call ????????????????????? 1 ????????????????????????????????????????????? ?????? primary ????????? secondary on-call ???????????? Follow the sun ??????????????????????????????????????? timezone ???????????????????????? timezone on-call ????????????????????????????????????????????????????????? Escalation ????????? primary ???????????????????????? 5-15 ???????????? escalate ?????? secondary ??????????????????????????? escalate ?????? manager Compensation ???????????????????????????????????? on-call ?????????????????????????????? ????????????????????? off-hours pages ???????????????????????????????????????

Q: MTTA ????????? MTTR ???????????????????????????????????????????????????????????????????

A: MTTA (Mean Time to Acknowledge) ???????????????????????? < 5 ???????????? ?????????????????? critical incidents ??????????????????????????? industry ????????????????????? 3-10 ???????????? ????????????????????? 15 ???????????? ???????????????????????? escalation policy MTTR (Mean Time to Resolve) ????????????????????? severity Critical: ???????????????????????? < 30 ???????????? High: < 2 ????????????????????? Medium: < 8 ????????????????????? ??????????????????????????? industry ????????????????????? 30-60 ???????????? ?????????????????? critical ?????? MTTR ?????????????????? ?????? runbooks ???????????????, automation ?????????????????? common issues, on-call training ????????????????????????, post-incident reviews ???????????????????????????????????? recurring incidents

📖 บทความที่เกี่ยวข้อง

PagerDuty Incident Message Queue Designอ่านบทความ → Linux Perf Tools Monitoring และ Alertingอ่านบทความ → MinIO Object Storage Monitoring และ Alertingอ่านบทความ → LangChain Agent Monitoring และ Alertingอ่านบทความ → PagerDuty Incident API Gateway Patternอ่านบทความ →

📚 ดูบทความทั้งหมด →