Opsgenie Productivity
Opsgenie Alert Team Productivity Alert Routing On-call Optimization Noise Reduction MTTA MTTR Escalation Deduplication Jira Slack Integration Production
| Metric | Before Opsgenie | After Opsgenie | Improvement | Target |
|---|---|---|---|---|
| MTTA | 15 min | 3 min | -80% | < 5 min |
| MTTR | 45 min | 22 min | -51% | < 30 min |
| Alert Volume/week | 500 | 180 | -64% | < 200 |
| Escalation Rate | 25% | 8% | -68% | < 10% |
| False Positive | 35% | 12% | -66% | < 10% |
| Burnout Score | 7/10 | 3.5/10 | -50% | < 4/10 |
Alert Optimization
# === Opsgenie Alert Optimization ===
# Alert Policy — Deduplication
# Opsgenie → Settings → Alert Policies
# Policy: Deduplicate by alias
# Condition: When alias matches existing alert
# Action: Add count, update message, don't create new
# Result: 500 alerts/week → 180 alerts/week
# Alert Policy — Auto-close Transient
# Condition: Alert not acknowledged within 5 min AND source = "monitoring"
# AND priority = P4 or P5
# Action: Close alert automatically
# Result: Reduces noise for low-priority transient issues
# Notification Policy — Priority-based
# P1 Critical: Push + SMS + Phone Call (immediate)
# P2 High: Push + SMS (immediate)
# P3 Medium: Push only (immediate)
# P4 Low: Email only (batch every 30 min)
# P5 Info: Email digest (daily)
from dataclasses import dataclass
@dataclass
class AlertPolicy:
name: str
condition: str
action: str
impact: str
policies = [
AlertPolicy("Dedup by Alias", "alias matches existing open alert",
"Increment count, update description", "Alert volume -60%"),
AlertPolicy("Auto-close Transient", "P4/P5 not ack'd in 5min + recovery signal",
"Close alert automatically", "Noise -25%"),
AlertPolicy("Correlation", "Same host within 5 min window",
"Group into parent alert", "Related alerts grouped"),
AlertPolicy("Priority Override", "Source=Prometheus AND severity=warning",
"Set priority to P3", "Consistent priority mapping"),
AlertPolicy("Maintenance Suppress", "During maintenance window",
"Suppress alert, log only", "Zero false alerts during deploy"),
AlertPolicy("Enrich with Runbook", "All P1/P2 alerts",
"Attach runbook URL from tag", "Faster resolution"),
]
print("=== Alert Policies ===")
for p in policies:
print(f" [{p.name}]")
print(f" When: {p.condition}")
print(f" Do: {p.action}")
print(f" Impact: {p.impact}")
Team Metrics Dashboard
# === Team Performance Metrics ===
@dataclass
class TeamMetric:
team: str
mtta: str
mttr: str
alerts_week: int
escalation_rate: str
postmortem_rate: str
burnout_score: str
teams = [
TeamMetric("Platform", "2.5 min", "18 min", 45, "5%", "100%", "3.2/10"),
TeamMetric("Backend", "3.8 min", "25 min", 62, "8%", "90%", "4.1/10"),
TeamMetric("Frontend", "4.2 min", "15 min", 28, "3%", "100%", "2.5/10"),
TeamMetric("Data", "5.1 min", "35 min", 38, "12%", "85%", "4.5/10"),
TeamMetric("Security", "1.8 min", "22 min", 15, "2%", "100%", "2.8/10"),
]
print("=== Team Performance ===")
for t in teams:
print(f" [{t.team}] MTTA: {t.mtta} | MTTR: {t.mttr}")
print(f" Alerts/week: {t.alerts_week} | Escalation: {t.escalation_rate}")
print(f" Postmortem: {t.postmortem_rate} | Burnout: {t.burnout_score}")
# Improvement Actions per Team
improvements = {
"Platform": "MTTA excellent, continue current approach",
"Backend": "MTTR borderline — add more runbooks, reduce alert volume",
"Frontend": "Good overall — low alerts, fast resolution",
"Data": "MTTR too high — investigate slow queries, add auto-remediation",
"Security": "MTTA best — postmortem rate needs 100%, 2 missing",
}
print(f"\n\nImprovement Actions:")
for k, v in improvements.items():
print(f" [{k}]: {v}")
Integration and Automation
# === Opsgenie Integrations ===
# Terraform — Opsgenie Config as Code
# resource "opsgenie_team" "platform" {
# name = "Platform Team"
# member {
# id = opsgenie_user.alice.id
# role = "admin"
# }
# }
#
# resource "opsgenie_schedule" "platform_oncall" {
# name = "Platform On-call"
# team_id = opsgenie_team.platform.id
# timezone = "Asia/Bangkok"
# }
#
# resource "opsgenie_escalation" "platform_p1" {
# name = "Platform P1 Escalation"
# rules {
# condition = "if-not-acked"
# delay = 5
# notify_type = "schedule"
# recipient { id = opsgenie_schedule.platform_oncall.id }
# }
# }
# Jira Integration
# Trigger: P1 or P2 alert created
# Action: Create Jira Issue
# Project: OPS
# Type: Incident
# Summary: {{alert.message}}
# Priority: {{alert.priority}}
# Labels: [incident, {{alert.source}}]
# Assignee: On-call person
@dataclass
class Integration:
tool: str
direction: str
trigger: str
action: str
benefit: str
integrations = [
Integration("Prometheus", "Inbound", "AlertManager fires", "Create Opsgenie alert", "Unified alert management"),
Integration("Datadog", "Inbound", "Monitor triggers", "Create alert with context", "APM + incident in one"),
Integration("Jira", "Outbound", "P1/P2 alert created", "Create Jira incident issue", "Track resolution in Jira"),
Integration("Slack", "Bidirectional", "Alert created/updated", "Post to channel + ack from Slack", "Team visibility"),
Integration("Confluence", "Outbound", "P1 resolved", "Create postmortem template", "Consistent postmortems"),
Integration("Terraform", "Config", "Git push", "Update Opsgenie config", "Config as Code"),
Integration("GitHub Actions", "Outbound", "Alert with auto-remediate tag", "Trigger remediation workflow", "Auto-healing"),
]
print("Integrations:")
for i in integrations:
print(f" [{i.tool}] {i.direction}")
print(f" Trigger: {i.trigger}")
print(f" Action: {i.action}")
print(f" Benefit: {i.benefit}")
เคล็ดลับ
- Dedup: ตั้ง Deduplication ทุก Alert Source ลด Volume 60%
- Priority: ใช้ Priority P1-P5 กำหนด Notification Channel ตาม Priority
- Runbook: แนบ Runbook URL กับทุก Alert ลด MTTR
- Review: Review Alert Rules ทุกเดือน ลบที่ไม่จำเป็น
- Terraform: จัดการ Opsgenie Config ด้วย Terraform เป็น Code
Opsgenie ช่วย Team Productivity อย่างไร
Alert Routing อัตโนมัติ On-call Schedule เวรชัดเจน Escalation ส่งต่อ Noise Reduction ลดซ้ำ Jira Integration Runbook แก้ปัญหาเร็ว
ลด Alert Fatigue อย่างไร
Deduplication รวมซ้ำ 40-60% Correlation เกี่ยวข้อง Threshold ปรับ Priority P1-P5 Maintenance Window Auto-close Heartbeat Review ทุกเดือน
วัดประสิทธิภาพทีมอย่างไร
MTTA ตอบรับ 5 นาที MTTR แก้ปัญหา 30 นาที Alert Volume ลดลง Escalation Rate 10% Burnout Score กระจาย Postmortem 100% P1
Integrate กับเครื่องมืออื่นอย่างไร
Jira Issue อัตโนมัติ Slack Channel Prometheus AlertManager Datadog APM GitHub Actions Workflow Confluence Postmortem Terraform Config as Code
สรุป
Opsgenie Alert Team Productivity Routing On-call Noise Reduction MTTA MTTR Deduplication Priority Jira Slack Terraform Integration Production Incident
