Opsgenie Alert Best Practices ที่ต้องรู้ —

Opsgenie Alert

Opsgenie Alert Best Practices Incident Management On-call Schedule Escalation Policy Integration Noise Reduction Deduplication Correlation Priority Routing Production

Feature	Opsgenie	PagerDuty	VictorOps	Grafana OnCall
Pricing	$9-29/user	$21-41/user	$9-49/user	Free OSS
Integrations	200+	700+	100+	50+
Mobile App	ดีมาก	ดีมาก	ดี	ปานกลาง
Jira Integration	Native	Plugin	Plugin	Plugin
On-call	ดีมาก	ดีมาก	ดี	ดี
เหมาะกับ	Atlassian Stack	Enterprise	DevOps	Budget/OSS

Alert Configuration

# === Opsgenie Alert Best Practices ===

# API — Create Alert
# curl -X POST https://api.opsgenie.com/v2/alerts \
#   -H "Content-Type: application/json" \
#   -H "Authorization: GenieKey YOUR_API_KEY" \
#   -d '{
#     "message": "CPU usage critical on web-server-01",
#     "alias": "cpu-critical-web-01",
#     "description": "CPU usage exceeded 95% for 5 minutes",
#     "responders": [{"type": "team", "name": "Platform Team"}],
#     "tags": ["critical", "infrastructure", "cpu"],
#     "priority": "P1",
#     "entity": "web-server-01",
#     "source": "Prometheus",
#     "details": {
#       "cpu_usage": "97%",
#       "duration": "5 minutes",
#       "host": "web-server-01",
#       "region": "ap-southeast-1"
#     }
#   }'

# Prometheus AlertManager Integration
# alertmanager.yml:
# receivers:
#   - name: opsgenie-critical
#     opsgenie_configs:
#       - api_key: YOUR_API_KEY
#         message: '{{ .CommonLabels.alertname }}'
#         priority: P1
#         tags: 'critical,{{ .CommonLabels.severity }}'
#         responders:
#           - type: team
#             name: Platform Team
#
#   - name: opsgenie-warning
#     opsgenie_configs:
#       - api_key: YOUR_API_KEY
#         priority: P3
#         responders:
#           - type: team
#             name: Platform Team

from dataclasses import dataclass

@dataclass
class AlertRule:
    name: str
    source: str
    condition: str
    priority: str
    team: str
    dedup_key: str

rules = [
    AlertRule("CPU Critical", "Prometheus", "cpu_usage > 95% for 5m", "P1", "Platform", "cpu-{host}"),
    AlertRule("Memory Critical", "Prometheus", "memory_usage > 90% for 5m", "P1", "Platform", "mem-{host}"),
    AlertRule("Disk Warning", "Prometheus", "disk_usage > 80%", "P3", "Platform", "disk-{host}-{mount}"),
    AlertRule("API Error Rate", "Datadog", "error_rate > 5% for 3m", "P2", "Backend", "api-error-{service}"),
    AlertRule("Service Down", "Uptime Kuma", "HTTP check failed 3x", "P1", "On-call", "down-{service}"),
    AlertRule("SSL Expiry", "Custom", "cert_days_left < 14", "P3", "Platform", "ssl-{domain}"),
    AlertRule("DB Replication", "CloudWatch", "replica_lag > 30s", "P2", "DBA", "db-lag-{instance}"),
]

print("=== Alert Rules ===")
for r in rules:
    print(f"  [{r.priority}] {r.name} | Source: {r.source}")
    print(f"    Condition: {r.condition}")
    print(f"    Team: {r.team} | Dedup: {r.dedup_key}")

On-call and Escalation

# === On-call Schedule and Escalation ===

# Schedule: Platform Team — Weekly Rotation
# Week 1: Alice (Primary) + Bob (Secondary)
# Week 2: Bob (Primary) + Charlie (Secondary)
# Week 3: Charlie (Primary) + Alice (Secondary)
# Rotation: Every Monday 09:00 ICT

# Escalation Policy: Critical (P1)
# Step 1 (0 min): Notify On-call Primary → Push + SMS
# Step 2 (5 min): Notify On-call Secondary → Push + SMS + Call
# Step 3 (15 min): Notify Team Lead → Push + SMS + Call + Email
# Step 4 (30 min): Notify Engineering Manager → All channels
# Repeat: 3 times

# Escalation Policy: Warning (P3)
# Step 1 (0 min): Notify On-call Primary → Push only
# Step 2 (30 min): Notify On-call Secondary → Push + Email
# No further escalation — review in morning standup

@dataclass
class EscalationStep:
    step: int
    wait_min: int
    notify: str
    channels: str
    condition: str

p1_escalation = [
    EscalationStep(1, 0, "On-call Primary", "Push + SMS", "Alert created"),
    EscalationStep(2, 5, "On-call Secondary", "Push + SMS + Call", "Not acknowledged"),
    EscalationStep(3, 15, "Team Lead", "Push + SMS + Call + Email", "Not acknowledged"),
    EscalationStep(4, 30, "Engineering Manager", "All channels", "Not acknowledged"),
]

print("=== P1 Escalation Policy ===")
for s in p1_escalation:
    print(f"  Step {s.step} (+{s.wait_min}min): {s.notify}")
    print(f"    Channels: {s.channels} | When: {s.condition}")

# Noise Reduction Config
noise_reduction = {
    "Deduplication": "alias field เป็น key รวม Alert ซ้ำ ลด 40%",
    "Correlation": "รวม Alert จาก Host เดียวกันใน 5 นาที",
    "Threshold Tuning": "ปรับ Threshold จาก 80% เป็น 90% ลด False Positive",
    "Maintenance Window": "ปิด Alert ช่วง Deployment 30 นาที",
    "Auto-close": "Transient Alert ปิดอัตโนมัติ ถ้า Recover ใน 2 นาที",
    "Priority Filtering": "P4-P5 ส่งแค่ Email ไม่ Push ไม่โทร",
}

print(f"\n\nNoise Reduction Strategies:")
for k, v in noise_reduction.items():
    print(f"  [{k}]: {v}")

Metrics and Improvement

# === Incident Metrics ===

@dataclass
class IncidentMetric:
    metric: str
    current: str
    target: str
    trend: str

metrics = [
    IncidentMetric("MTTA (Mean Time to Acknowledge)", "3.2 min", "<5 min", "Good"),
    IncidentMetric("MTTR (Mean Time to Resolve)", "28 min", "<30 min", "Borderline"),
    IncidentMetric("P1 Incidents/month", "4", "<5", "Good"),
    IncidentMetric("P2 Incidents/month", "12", "<15", "Good"),
    IncidentMetric("Alert Noise Ratio", "15%", "<10%", "Needs work"),
    IncidentMetric("Escalation Rate", "8%", "<10%", "Good"),
    IncidentMetric("On-call Burnout Score", "3.5/10", "<4/10", "Acceptable"),
    IncidentMetric("Postmortem Completion", "92%", "100%", "Needs work"),
]

print("Incident Metrics:")
for m in metrics:
    print(f"  [{m.trend}] {m.metric}: {m.current} (Target: {m.target})")

improvement_actions = {
    "Week 1": "Tune CPU alert threshold 80→90% — ลด noise 25%",
    "Week 2": "Add dedup key to all Prometheus alerts — ลด duplicates 40%",
    "Week 3": "Create maintenance windows for deployments — ลด false alerts",
    "Week 4": "Review and close 15 stale alert rules — cleaner config",
    "Month 2": "Implement correlation rules for multi-service alerts",
    "Month 3": "Add auto-remediation for disk full (auto cleanup)",
}

print(f"\n\nImprovement Roadmap:")
for k, v in improvement_actions.items():
    print(f"  [{k}]: {v}")

เคล็ดลับ

Dedup: ใช้ alias field เป็น Dedup Key ทุก Alert ลดซ้ำ 40%
Priority: กำหนด Priority P1-P5 ชัดเจน P1 โทร P5 แค่ Email
Rotation: หมุนเวร On-call ทุกสัปดาห์ ลด Burnout
Postmortem: ทำ Postmortem ทุก P1 P2 Incident
Review: Review Alert Rules ทุกเดือน ลบที่ไม่จำเป็น

การนำความรู้ไปประยุกต์ใช้งานจริง

แหล่งเรียนรู้ที่แนะนำ ได้แก่ Official Documentation ที่อัพเดทล่าสุดเสมอ Online Course จาก Coursera Udemy edX ช่อง YouTube คุณภาพทั้งไทยและอังกฤษ และ Community อย่าง Discord Reddit Stack Overflow ที่ช่วยแลกเปลี่ยนประสบการณ์กับนักพัฒนาทั่วโลก

เนื้อหาเกี่ยวข้อง — บทความที่เกี่ยวข้อง: Prometheus Alertmanager Micro-segmentation

Opsgenie คืออะไร

Incident Management Atlassian Alert Routing On-call Schedule Escalation Multi-channel Push SMS Call Email Slack 200+ Integration Mobile App

ตั้ง On-call Schedule อย่างไร

Schedule Rotation Weekly Daily Custom Override วันหยุด Restriction นอกเวลา Primary Secondary Escalation Policy Calendar Web Mobile

แนะนำเพิ่มเติม — คอร์สเทรด Forex ที่ iCafeForex

เนื้อหาเกี่ยวข้อง — แนะนำให้อ่าน โน้ตบุ๊คmsi

ลด Alert Noise อย่างไร

Deduplication รวมซ้ำ Correlation เกี่ยวข้อง Threshold ปรับ Priority แยก Maintenance Window Auto-close Transient Review ทุกเดือน

เนื้อหาเกี่ยวข้อง — แนะนำให้อ่าน carry trade how to — ข้อมูลครบถ้วน 2026

Escalation Policy ตั้งอย่างไร

Step 1 Primary Push SMS Step 2 Secondary Call Step 3 Team Lead Step 4 Manager Wait 5-30min Repeat Acknowledge Close Condition

แนะนำเพิ่มเติม — SiamCafeBook

สรุป

Opsgenie Alert Best Practices On-call Schedule Escalation Policy Deduplication Correlation Priority Noise Reduction MTTA MTTR Postmortem Incident Management Production

เนื้อหาเกี่ยวข้อง — แนะนำให้อ่าน Fivetran Connector Learning Path Roadmap — คู่มือฉบับสมบูรณ์ 2026