SiamCafe.net Blog
Cybersecurity

Opsgenie Alert Best Practices ที่ต้องรู้

opsgenie alert best practices ทตองร
Opsgenie Alert Best Practices ที่ต้องรู้ | SiamCafe Blog
2025-07-08· อ. บอม — SiamCafe.net· 10,378 คำ

Opsgenie Alert

Opsgenie Alert Best Practices Incident Management On-call Schedule Escalation Policy Integration Noise Reduction Deduplication Correlation Priority Routing Production

FeatureOpsgeniePagerDutyVictorOpsGrafana OnCall
Pricing$9-29/user$21-41/user$9-49/userFree OSS
Integrations200+700+100+50+
Mobile Appดีมากดีมากดีปานกลาง
Jira IntegrationNativePluginPluginPlugin
On-callดีมากดีมากดีดี
เหมาะกับAtlassian StackEnterpriseDevOpsBudget/OSS

Alert Configuration

# === Opsgenie Alert Best Practices ===

# API — Create Alert
# curl -X POST https://api.opsgenie.com/v2/alerts \
#   -H "Content-Type: application/json" \
#   -H "Authorization: GenieKey YOUR_API_KEY" \
#   -d '{
#     "message": "CPU usage critical on web-server-01",
#     "alias": "cpu-critical-web-01",
#     "description": "CPU usage exceeded 95% for 5 minutes",
#     "responders": [{"type": "team", "name": "Platform Team"}],
#     "tags": ["critical", "infrastructure", "cpu"],
#     "priority": "P1",
#     "entity": "web-server-01",
#     "source": "Prometheus",
#     "details": {
#       "cpu_usage": "97%",
#       "duration": "5 minutes",
#       "host": "web-server-01",
#       "region": "ap-southeast-1"
#     }
#   }'

# Prometheus AlertManager Integration
# alertmanager.yml:
# receivers:
#   - name: opsgenie-critical
#     opsgenie_configs:
#       - api_key: YOUR_API_KEY
#         message: '{{ .CommonLabels.alertname }}'
#         priority: P1
#         tags: 'critical,{{ .CommonLabels.severity }}'
#         responders:
#           - type: team
#             name: Platform Team
#
#   - name: opsgenie-warning
#     opsgenie_configs:
#       - api_key: YOUR_API_KEY
#         priority: P3
#         responders:
#           - type: team
#             name: Platform Team

from dataclasses import dataclass

@dataclass
class AlertRule:
    name: str
    source: str
    condition: str
    priority: str
    team: str
    dedup_key: str

rules = [
    AlertRule("CPU Critical", "Prometheus", "cpu_usage > 95% for 5m", "P1", "Platform", "cpu-{host}"),
    AlertRule("Memory Critical", "Prometheus", "memory_usage > 90% for 5m", "P1", "Platform", "mem-{host}"),
    AlertRule("Disk Warning", "Prometheus", "disk_usage > 80%", "P3", "Platform", "disk-{host}-{mount}"),
    AlertRule("API Error Rate", "Datadog", "error_rate > 5% for 3m", "P2", "Backend", "api-error-{service}"),
    AlertRule("Service Down", "Uptime Kuma", "HTTP check failed 3x", "P1", "On-call", "down-{service}"),
    AlertRule("SSL Expiry", "Custom", "cert_days_left < 14", "P3", "Platform", "ssl-{domain}"),
    AlertRule("DB Replication", "CloudWatch", "replica_lag > 30s", "P2", "DBA", "db-lag-{instance}"),
]

print("=== Alert Rules ===")
for r in rules:
    print(f"  [{r.priority}] {r.name} | Source: {r.source}")
    print(f"    Condition: {r.condition}")
    print(f"    Team: {r.team} | Dedup: {r.dedup_key}")

On-call and Escalation

# === On-call Schedule and Escalation ===

# Schedule: Platform Team — Weekly Rotation
# Week 1: Alice (Primary) + Bob (Secondary)
# Week 2: Bob (Primary) + Charlie (Secondary)
# Week 3: Charlie (Primary) + Alice (Secondary)
# Rotation: Every Monday 09:00 ICT

# Escalation Policy: Critical (P1)
# Step 1 (0 min): Notify On-call Primary → Push + SMS
# Step 2 (5 min): Notify On-call Secondary → Push + SMS + Call
# Step 3 (15 min): Notify Team Lead → Push + SMS + Call + Email
# Step 4 (30 min): Notify Engineering Manager → All channels
# Repeat: 3 times

# Escalation Policy: Warning (P3)
# Step 1 (0 min): Notify On-call Primary → Push only
# Step 2 (30 min): Notify On-call Secondary → Push + Email
# No further escalation — review in morning standup

@dataclass
class EscalationStep:
    step: int
    wait_min: int
    notify: str
    channels: str
    condition: str

p1_escalation = [
    EscalationStep(1, 0, "On-call Primary", "Push + SMS", "Alert created"),
    EscalationStep(2, 5, "On-call Secondary", "Push + SMS + Call", "Not acknowledged"),
    EscalationStep(3, 15, "Team Lead", "Push + SMS + Call + Email", "Not acknowledged"),
    EscalationStep(4, 30, "Engineering Manager", "All channels", "Not acknowledged"),
]

print("=== P1 Escalation Policy ===")
for s in p1_escalation:
    print(f"  Step {s.step} (+{s.wait_min}min): {s.notify}")
    print(f"    Channels: {s.channels} | When: {s.condition}")

# Noise Reduction Config
noise_reduction = {
    "Deduplication": "alias field เป็น key รวม Alert ซ้ำ ลด 40%",
    "Correlation": "รวม Alert จาก Host เดียวกันใน 5 นาที",
    "Threshold Tuning": "ปรับ Threshold จาก 80% เป็น 90% ลด False Positive",
    "Maintenance Window": "ปิด Alert ช่วง Deployment 30 นาที",
    "Auto-close": "Transient Alert ปิดอัตโนมัติ ถ้า Recover ใน 2 นาที",
    "Priority Filtering": "P4-P5 ส่งแค่ Email ไม่ Push ไม่โทร",
}

print(f"\n\nNoise Reduction Strategies:")
for k, v in noise_reduction.items():
    print(f"  [{k}]: {v}")

Metrics and Improvement

# === Incident Metrics ===

@dataclass
class IncidentMetric:
    metric: str
    current: str
    target: str
    trend: str

metrics = [
    IncidentMetric("MTTA (Mean Time to Acknowledge)", "3.2 min", "<5 min", "Good"),
    IncidentMetric("MTTR (Mean Time to Resolve)", "28 min", "<30 min", "Borderline"),
    IncidentMetric("P1 Incidents/month", "4", "<5", "Good"),
    IncidentMetric("P2 Incidents/month", "12", "<15", "Good"),
    IncidentMetric("Alert Noise Ratio", "15%", "<10%", "Needs work"),
    IncidentMetric("Escalation Rate", "8%", "<10%", "Good"),
    IncidentMetric("On-call Burnout Score", "3.5/10", "<4/10", "Acceptable"),
    IncidentMetric("Postmortem Completion", "92%", "100%", "Needs work"),
]

print("Incident Metrics:")
for m in metrics:
    print(f"  [{m.trend}] {m.metric}: {m.current} (Target: {m.target})")

improvement_actions = {
    "Week 1": "Tune CPU alert threshold 80→90% — ลด noise 25%",
    "Week 2": "Add dedup key to all Prometheus alerts — ลด duplicates 40%",
    "Week 3": "Create maintenance windows for deployments — ลด false alerts",
    "Week 4": "Review and close 15 stale alert rules — cleaner config",
    "Month 2": "Implement correlation rules for multi-service alerts",
    "Month 3": "Add auto-remediation for disk full (auto cleanup)",
}

print(f"\n\nImprovement Roadmap:")
for k, v in improvement_actions.items():
    print(f"  [{k}]: {v}")

เคล็ดลับ

การนำความรู้ไปประยุกต์ใช้งานจริง

แหล่งเรียนรู้ที่แนะนำ ได้แก่ Official Documentation ที่อัพเดทล่าสุดเสมอ Online Course จาก Coursera Udemy edX ช่อง YouTube คุณภาพทั้งไทยและอังกฤษ และ Community อย่าง Discord Reddit Stack Overflow ที่ช่วยแลกเปลี่ยนประสบการณ์กับนักพัฒนาทั่วโลก

Opsgenie คืออะไร

Incident Management Atlassian Alert Routing On-call Schedule Escalation Multi-channel Push SMS Call Email Slack 200+ Integration Mobile App

ตั้ง On-call Schedule อย่างไร

Schedule Rotation Weekly Daily Custom Override วันหยุด Restriction นอกเวลา Primary Secondary Escalation Policy Calendar Web Mobile

ลด Alert Noise อย่างไร

Deduplication รวมซ้ำ Correlation เกี่ยวข้อง Threshold ปรับ Priority แยก Maintenance Window Auto-close Transient Review ทุกเดือน

Escalation Policy ตั้งอย่างไร

Step 1 Primary Push SMS Step 2 Secondary Call Step 3 Team Lead Step 4 Manager Wait 5-30min Repeat Acknowledge Close Condition

สรุป

Opsgenie Alert Best Practices On-call Schedule Escalation Policy Deduplication Correlation Priority Noise Reduction MTTA MTTR Postmortem Incident Management Production

📖 บทความที่เกี่ยวข้อง

Opsgenie Alert Certification Pathอ่านบทความ → Weights Biases Best Practices ที่ต้องรู้อ่านบทความ → Opsgenie Alert AR VR Developmentอ่านบทความ → Opsgenie Alert Technical Debt Managementอ่านบทความ → Opsgenie Alert FinOps Cloud Costอ่านบทความ →

📚 ดูบทความทั้งหมด →