Opsgenie Alert Best Practices ที่ต้องรู้ —
Opsgenie Alert

Opsgenie Alert Best Practices Incident Management On-call Schedule Escalation Policy Integration Noise Reduction Deduplication Correlation Priority Routing Production
| Feature | Opsgenie | PagerDuty | VictorOps | Grafana OnCall |
|---|---|---|---|---|
| Pricing | $9-29/user | $21-41/user | $9-49/user | Free OSS |
| Integrations | 200+ | 700+ | 100+ | 50+ |
| Mobile App | ดีมาก | ดีมาก | ดี | ปานกลาง |
| Jira Integration | Native | Plugin | Plugin | Plugin |
| On-call | ดีมาก | ดีมาก | ดี | ดี |
| เหมาะกับ | Atlassian Stack | Enterprise | DevOps | Budget/OSS |
Alert Configuration
# === Opsgenie Alert Best Practices ===
# API — Create Alert
# curl -X POST https://api.opsgenie.com/v2/alerts \
# -H "Content-Type: application/json" \
# -H "Authorization: GenieKey YOUR_API_KEY" \
# -d '{
# "message": "CPU usage critical on web-server-01",
# "alias": "cpu-critical-web-01",
# "description": "CPU usage exceeded 95% for 5 minutes",
# "responders": [{"type": "team", "name": "Platform Team"}],
# "tags": ["critical", "infrastructure", "cpu"],
# "priority": "P1",
# "entity": "web-server-01",
# "source": "Prometheus",
# "details": {
# "cpu_usage": "97%",
# "duration": "5 minutes",
# "host": "web-server-01",
# "region": "ap-southeast-1"
# }
# }'
# Prometheus AlertManager Integration
# alertmanager.yml:
# receivers:
# - name: opsgenie-critical
# opsgenie_configs:
# - api_key: YOUR_API_KEY
# message: '{{ .CommonLabels.alertname }}'
# priority: P1
# tags: 'critical,{{ .CommonLabels.severity }}'
# responders:
# - type: team
# name: Platform Team
#
# - name: opsgenie-warning
# opsgenie_configs:
# - api_key: YOUR_API_KEY
# priority: P3
# responders:
# - type: team
# name: Platform Team
from dataclasses import dataclass
@dataclass
class AlertRule:
name: str
source: str
condition: str
priority: str
team: str
dedup_key: str
rules = [
AlertRule("CPU Critical", "Prometheus", "cpu_usage > 95% for 5m", "P1", "Platform", "cpu-{host}"),
AlertRule("Memory Critical", "Prometheus", "memory_usage > 90% for 5m", "P1", "Platform", "mem-{host}"),
AlertRule("Disk Warning", "Prometheus", "disk_usage > 80%", "P3", "Platform", "disk-{host}-{mount}"),
AlertRule("API Error Rate", "Datadog", "error_rate > 5% for 3m", "P2", "Backend", "api-error-{service}"),
AlertRule("Service Down", "Uptime Kuma", "HTTP check failed 3x", "P1", "On-call", "down-{service}"),
AlertRule("SSL Expiry", "Custom", "cert_days_left < 14", "P3", "Platform", "ssl-{domain}"),
AlertRule("DB Replication", "CloudWatch", "replica_lag > 30s", "P2", "DBA", "db-lag-{instance}"),
]
print("=== Alert Rules ===")
for r in rules:
print(f" [{r.priority}] {r.name} | Source: {r.source}")
print(f" Condition: {r.condition}")
print(f" Team: {r.team} | Dedup: {r.dedup_key}")
On-call and Escalation
# === On-call Schedule and Escalation ===
# Schedule: Platform Team — Weekly Rotation
# Week 1: Alice (Primary) + Bob (Secondary)
# Week 2: Bob (Primary) + Charlie (Secondary)
# Week 3: Charlie (Primary) + Alice (Secondary)
# Rotation: Every Monday 09:00 ICT
# Escalation Policy: Critical (P1)
# Step 1 (0 min): Notify On-call Primary → Push + SMS
# Step 2 (5 min): Notify On-call Secondary → Push + SMS + Call
# Step 3 (15 min): Notify Team Lead → Push + SMS + Call + Email
# Step 4 (30 min): Notify Engineering Manager → All channels
# Repeat: 3 times
# Escalation Policy: Warning (P3)
# Step 1 (0 min): Notify On-call Primary → Push only
# Step 2 (30 min): Notify On-call Secondary → Push + Email
# No further escalation — review in morning standup
@dataclass
class EscalationStep:
step: int
wait_min: int
notify: str
channels: str
condition: str
p1_escalation = [
EscalationStep(1, 0, "On-call Primary", "Push + SMS", "Alert created"),
EscalationStep(2, 5, "On-call Secondary", "Push + SMS + Call", "Not acknowledged"),
EscalationStep(3, 15, "Team Lead", "Push + SMS + Call + Email", "Not acknowledged"),
EscalationStep(4, 30, "Engineering Manager", "All channels", "Not acknowledged"),
]
print("=== P1 Escalation Policy ===")
for s in p1_escalation:
print(f" Step {s.step} (+{s.wait_min}min): {s.notify}")
print(f" Channels: {s.channels} | When: {s.condition}")
# Noise Reduction Config
noise_reduction = {
"Deduplication": "alias field เป็น key รวม Alert ซ้ำ ลด 40%",
"Correlation": "รวม Alert จาก Host เดียวกันใน 5 นาที",
"Threshold Tuning": "ปรับ Threshold จาก 80% เป็น 90% ลด False Positive",
"Maintenance Window": "ปิด Alert ช่วง Deployment 30 นาที",
"Auto-close": "Transient Alert ปิดอัตโนมัติ ถ้า Recover ใน 2 นาที",
"Priority Filtering": "P4-P5 ส่งแค่ Email ไม่ Push ไม่โทร",
}
print(f"\n\nNoise Reduction Strategies:")
for k, v in noise_reduction.items():
print(f" [{k}]: {v}")
Metrics and Improvement
# === Incident Metrics ===
@dataclass
class IncidentMetric:
metric: str
current: str
target: str
trend: str
metrics = [
IncidentMetric("MTTA (Mean Time to Acknowledge)", "3.2 min", "<5 min", "Good"),
IncidentMetric("MTTR (Mean Time to Resolve)", "28 min", "<30 min", "Borderline"),
IncidentMetric("P1 Incidents/month", "4", "<5", "Good"),
IncidentMetric("P2 Incidents/month", "12", "<15", "Good"),
IncidentMetric("Alert Noise Ratio", "15%", "<10%", "Needs work"),
IncidentMetric("Escalation Rate", "8%", "<10%", "Good"),
IncidentMetric("On-call Burnout Score", "3.5/10", "<4/10", "Acceptable"),
IncidentMetric("Postmortem Completion", "92%", "100%", "Needs work"),
]
print("Incident Metrics:")
for m in metrics:
print(f" [{m.trend}] {m.metric}: {m.current} (Target: {m.target})")
improvement_actions = {
"Week 1": "Tune CPU alert threshold 80→90% — ลด noise 25%",
"Week 2": "Add dedup key to all Prometheus alerts — ลด duplicates 40%",
"Week 3": "Create maintenance windows for deployments — ลด false alerts",
"Week 4": "Review and close 15 stale alert rules — cleaner config",
"Month 2": "Implement correlation rules for multi-service alerts",
"Month 3": "Add auto-remediation for disk full (auto cleanup)",
}
print(f"\n\nImprovement Roadmap:")
for k, v in improvement_actions.items():
print(f" [{k}]: {v}")
เคล็ดลับ
- Dedup: ใช้ alias field เป็น Dedup Key ทุก Alert ลดซ้ำ 40%
- Priority: กำหนด Priority P1-P5 ชัดเจน P1 โทร P5 แค่ Email
- Rotation: หมุนเวร On-call ทุกสัปดาห์ ลด Burnout
- Postmortem: ทำ Postmortem ทุก P1 P2 Incident
- Review: Review Alert Rules ทุกเดือน ลบที่ไม่จำเป็น
การนำความรู้ไปประยุกต์ใช้งานจริง

แหล่งเรียนรู้ที่แนะนำ ได้แก่ Official Documentation ที่อัพเดทล่าสุดเสมอ Online Course จาก Coursera Udemy edX ช่อง YouTube คุณภาพทั้งไทยและอังกฤษ และ Community อย่าง Discord Reddit Stack Overflow ที่ช่วยแลกเปลี่ยนประสบการณ์กับนักพัฒนาทั่วโลก
เนื้อหาเกี่ยวข้อง — บทความที่เกี่ยวข้อง: Prometheus Alertmanager Micro-segmentation
Opsgenie คืออะไร
Incident Management Atlassian Alert Routing On-call Schedule Escalation Multi-channel Push SMS Call Email Slack 200+ Integration Mobile App
ตั้ง On-call Schedule อย่างไร
Schedule Rotation Weekly Daily Custom Override วันหยุด Restriction นอกเวลา Primary Secondary Escalation Policy Calendar Web Mobile
แนะนำเพิ่มเติม — คอร์สเทรด Forex ที่ iCafeForex
เนื้อหาเกี่ยวข้อง — แนะนำให้อ่าน โน้ตบุ๊คmsi
ลด Alert Noise อย่างไร
Deduplication รวมซ้ำ Correlation เกี่ยวข้อง Threshold ปรับ Priority แยก Maintenance Window Auto-close Transient Review ทุกเดือน
เนื้อหาเกี่ยวข้อง — แนะนำให้อ่าน carry trade how to — ข้อมูลครบถ้วน 2026
Escalation Policy ตั้งอย่างไร
Step 1 Primary Push SMS Step 2 Secondary Call Step 3 Team Lead Step 4 Manager Wait 5-30min Repeat Acknowledge Close Condition
แนะนำเพิ่มเติม — SiamCafeBook
สรุป
Opsgenie Alert Best Practices On-call Schedule Escalation Policy Deduplication Correlation Priority Noise Reduction MTTA MTTR Postmortem Incident Management Production
เนื้อหาเกี่ยวข้อง — แนะนำให้อ่าน Fivetran Connector Learning Path Roadmap — คู่มือฉบับสมบูรณ์ 2026





