Opsgenie Alert
Opsgenie Alert Best Practices Incident Management On-call Schedule Escalation Policy Integration Noise Reduction Deduplication Correlation Priority Routing Production
| Feature | Opsgenie | PagerDuty | VictorOps | Grafana OnCall |
|---|---|---|---|---|
| Pricing | $9-29/user | $21-41/user | $9-49/user | Free OSS |
| Integrations | 200+ | 700+ | 100+ | 50+ |
| Mobile App | ดีมาก | ดีมาก | ดี | ปานกลาง |
| Jira Integration | Native | Plugin | Plugin | Plugin |
| On-call | ดีมาก | ดีมาก | ดี | ดี |
| เหมาะกับ | Atlassian Stack | Enterprise | DevOps | Budget/OSS |
Alert Configuration
# === Opsgenie Alert Best Practices ===
# API — Create Alert
# curl -X POST https://api.opsgenie.com/v2/alerts \
# -H "Content-Type: application/json" \
# -H "Authorization: GenieKey YOUR_API_KEY" \
# -d '{
# "message": "CPU usage critical on web-server-01",
# "alias": "cpu-critical-web-01",
# "description": "CPU usage exceeded 95% for 5 minutes",
# "responders": [{"type": "team", "name": "Platform Team"}],
# "tags": ["critical", "infrastructure", "cpu"],
# "priority": "P1",
# "entity": "web-server-01",
# "source": "Prometheus",
# "details": {
# "cpu_usage": "97%",
# "duration": "5 minutes",
# "host": "web-server-01",
# "region": "ap-southeast-1"
# }
# }'
# Prometheus AlertManager Integration
# alertmanager.yml:
# receivers:
# - name: opsgenie-critical
# opsgenie_configs:
# - api_key: YOUR_API_KEY
# message: '{{ .CommonLabels.alertname }}'
# priority: P1
# tags: 'critical,{{ .CommonLabels.severity }}'
# responders:
# - type: team
# name: Platform Team
#
# - name: opsgenie-warning
# opsgenie_configs:
# - api_key: YOUR_API_KEY
# priority: P3
# responders:
# - type: team
# name: Platform Team
from dataclasses import dataclass
@dataclass
class AlertRule:
name: str
source: str
condition: str
priority: str
team: str
dedup_key: str
rules = [
AlertRule("CPU Critical", "Prometheus", "cpu_usage > 95% for 5m", "P1", "Platform", "cpu-{host}"),
AlertRule("Memory Critical", "Prometheus", "memory_usage > 90% for 5m", "P1", "Platform", "mem-{host}"),
AlertRule("Disk Warning", "Prometheus", "disk_usage > 80%", "P3", "Platform", "disk-{host}-{mount}"),
AlertRule("API Error Rate", "Datadog", "error_rate > 5% for 3m", "P2", "Backend", "api-error-{service}"),
AlertRule("Service Down", "Uptime Kuma", "HTTP check failed 3x", "P1", "On-call", "down-{service}"),
AlertRule("SSL Expiry", "Custom", "cert_days_left < 14", "P3", "Platform", "ssl-{domain}"),
AlertRule("DB Replication", "CloudWatch", "replica_lag > 30s", "P2", "DBA", "db-lag-{instance}"),
]
print("=== Alert Rules ===")
for r in rules:
print(f" [{r.priority}] {r.name} | Source: {r.source}")
print(f" Condition: {r.condition}")
print(f" Team: {r.team} | Dedup: {r.dedup_key}")
On-call and Escalation
# === On-call Schedule and Escalation ===
# Schedule: Platform Team — Weekly Rotation
# Week 1: Alice (Primary) + Bob (Secondary)
# Week 2: Bob (Primary) + Charlie (Secondary)
# Week 3: Charlie (Primary) + Alice (Secondary)
# Rotation: Every Monday 09:00 ICT
# Escalation Policy: Critical (P1)
# Step 1 (0 min): Notify On-call Primary → Push + SMS
# Step 2 (5 min): Notify On-call Secondary → Push + SMS + Call
# Step 3 (15 min): Notify Team Lead → Push + SMS + Call + Email
# Step 4 (30 min): Notify Engineering Manager → All channels
# Repeat: 3 times
# Escalation Policy: Warning (P3)
# Step 1 (0 min): Notify On-call Primary → Push only
# Step 2 (30 min): Notify On-call Secondary → Push + Email
# No further escalation — review in morning standup
@dataclass
class EscalationStep:
step: int
wait_min: int
notify: str
channels: str
condition: str
p1_escalation = [
EscalationStep(1, 0, "On-call Primary", "Push + SMS", "Alert created"),
EscalationStep(2, 5, "On-call Secondary", "Push + SMS + Call", "Not acknowledged"),
EscalationStep(3, 15, "Team Lead", "Push + SMS + Call + Email", "Not acknowledged"),
EscalationStep(4, 30, "Engineering Manager", "All channels", "Not acknowledged"),
]
print("=== P1 Escalation Policy ===")
for s in p1_escalation:
print(f" Step {s.step} (+{s.wait_min}min): {s.notify}")
print(f" Channels: {s.channels} | When: {s.condition}")
# Noise Reduction Config
noise_reduction = {
"Deduplication": "alias field เป็น key รวม Alert ซ้ำ ลด 40%",
"Correlation": "รวม Alert จาก Host เดียวกันใน 5 นาที",
"Threshold Tuning": "ปรับ Threshold จาก 80% เป็น 90% ลด False Positive",
"Maintenance Window": "ปิด Alert ช่วง Deployment 30 นาที",
"Auto-close": "Transient Alert ปิดอัตโนมัติ ถ้า Recover ใน 2 นาที",
"Priority Filtering": "P4-P5 ส่งแค่ Email ไม่ Push ไม่โทร",
}
print(f"\n\nNoise Reduction Strategies:")
for k, v in noise_reduction.items():
print(f" [{k}]: {v}")
Metrics and Improvement
# === Incident Metrics ===
@dataclass
class IncidentMetric:
metric: str
current: str
target: str
trend: str
metrics = [
IncidentMetric("MTTA (Mean Time to Acknowledge)", "3.2 min", "<5 min", "Good"),
IncidentMetric("MTTR (Mean Time to Resolve)", "28 min", "<30 min", "Borderline"),
IncidentMetric("P1 Incidents/month", "4", "<5", "Good"),
IncidentMetric("P2 Incidents/month", "12", "<15", "Good"),
IncidentMetric("Alert Noise Ratio", "15%", "<10%", "Needs work"),
IncidentMetric("Escalation Rate", "8%", "<10%", "Good"),
IncidentMetric("On-call Burnout Score", "3.5/10", "<4/10", "Acceptable"),
IncidentMetric("Postmortem Completion", "92%", "100%", "Needs work"),
]
print("Incident Metrics:")
for m in metrics:
print(f" [{m.trend}] {m.metric}: {m.current} (Target: {m.target})")
improvement_actions = {
"Week 1": "Tune CPU alert threshold 80→90% — ลด noise 25%",
"Week 2": "Add dedup key to all Prometheus alerts — ลด duplicates 40%",
"Week 3": "Create maintenance windows for deployments — ลด false alerts",
"Week 4": "Review and close 15 stale alert rules — cleaner config",
"Month 2": "Implement correlation rules for multi-service alerts",
"Month 3": "Add auto-remediation for disk full (auto cleanup)",
}
print(f"\n\nImprovement Roadmap:")
for k, v in improvement_actions.items():
print(f" [{k}]: {v}")
เคล็ดลับ
- Dedup: ใช้ alias field เป็น Dedup Key ทุก Alert ลดซ้ำ 40%
- Priority: กำหนด Priority P1-P5 ชัดเจน P1 โทร P5 แค่ Email
- Rotation: หมุนเวร On-call ทุกสัปดาห์ ลด Burnout
- Postmortem: ทำ Postmortem ทุก P1 P2 Incident
- Review: Review Alert Rules ทุกเดือน ลบที่ไม่จำเป็น
การนำความรู้ไปประยุกต์ใช้งานจริง
แหล่งเรียนรู้ที่แนะนำ ได้แก่ Official Documentation ที่อัพเดทล่าสุดเสมอ Online Course จาก Coursera Udemy edX ช่อง YouTube คุณภาพทั้งไทยและอังกฤษ และ Community อย่าง Discord Reddit Stack Overflow ที่ช่วยแลกเปลี่ยนประสบการณ์กับนักพัฒนาทั่วโลก
Opsgenie คืออะไร
Incident Management Atlassian Alert Routing On-call Schedule Escalation Multi-channel Push SMS Call Email Slack 200+ Integration Mobile App
ตั้ง On-call Schedule อย่างไร
Schedule Rotation Weekly Daily Custom Override วันหยุด Restriction นอกเวลา Primary Secondary Escalation Policy Calendar Web Mobile
ลด Alert Noise อย่างไร
Deduplication รวมซ้ำ Correlation เกี่ยวข้อง Threshold ปรับ Priority แยก Maintenance Window Auto-close Transient Review ทุกเดือน
Escalation Policy ตั้งอย่างไร
Step 1 Primary Push SMS Step 2 Secondary Call Step 3 Team Lead Step 4 Manager Wait 5-30min Repeat Acknowledge Close Condition
สรุป
Opsgenie Alert Best Practices On-call Schedule Escalation Policy Deduplication Correlation Priority Noise Reduction MTTA MTTR Postmortem Incident Management Production
