Opsgenie MLOps Alert
Opsgenie Alert MLOps On-call Escalation Prometheus Grafana ML Pipeline Training Serving Drift Latency Runbook MTTR Production
| Alert Type | Source | Priority | Team | Action |
|---|---|---|---|---|
| Training Job Failed | Airflow / Databricks | P2 | ML Platform | ตรวจ Log Retry Job |
| Model Quality Drop | Custom Monitor | P2 | Data Science | ตรวจ Data Drift Retrain |
| Data Drift Detected | Prometheus / Custom | P3 | Data Science | ตรวจ Feature Distribution |
| Serving Latency High | Prometheus | P1 | ML Infra | Scale Up Optimize Model |
| GPU Node Down | Kubernetes / CloudWatch | P1 | ML Infra | Failover Replace Node |
| Feature Store Stale | Custom Heartbeat | P3 | ML Platform | ตรวจ Pipeline Retry |
Alert Configuration
# === Opsgenie Alert Setup ===
# Prometheus AlertManager → Opsgenie Integration
# alertmanager.yml:
# receivers:
# - name: 'opsgenie-ml'
# opsgenie_configs:
# - api_key: 'YOUR_OPSGENIE_API_KEY'
# message: '{{ .GroupLabels.alertname }}'
# priority: '{{ if eq .GroupLabels.severity "critical" }}P1{{ else }}P3{{ end }}'
# tags: 'ml,{{ .GroupLabels.team }}'
# description: '{{ .CommonAnnotations.description }}'
#
# Prometheus Alert Rules:
# groups:
# - name: ml-alerts
# rules:
# - alert: ModelServingLatencyHigh
# expr: histogram_quantile(0.99, rate(model_serving_duration_seconds_bucket[5m])) > 0.5
# for: 5m
# labels:
# severity: critical
# team: ml-infra
# annotations:
# description: "Model serving P99 latency > 500ms"
#
# - alert: ModelAccuracyDrop
# expr: model_accuracy < 0.85
# for: 15m
# labels:
# severity: warning
# team: data-science
from dataclasses import dataclass
@dataclass
class AlertRule:
alert: str
metric: str
threshold: str
priority: str
team: str
runbook: str
rules = [
AlertRule("ModelServingLatencyHigh",
"histogram_quantile(0.99, model_serving_duration)",
"> 500ms for 5m",
"P1 Critical",
"ML Infra",
"1. ตรวจ Pod CPU/Memory 2.Scale HPA 3. ตรวจ Model Size"),
AlertRule("ModelAccuracyDrop",
"model_accuracy (Custom Metric)",
"< 0.85 for 15m",
"P2 High",
"Data Science",
"1.ตรวจ Data Drift 2.Compare Feature Distribution 3.Retrain"),
AlertRule("TrainingJobFailed",
"airflow_task_status == 'failed'",
"Any failure",
"P2 High",
"ML Platform",
"1.ตรวจ Log 2.ตรวจ GPU Memory 3.Retry Job"),
AlertRule("FeatureStoreStale",
"time() - feature_last_updated_timestamp",
"> 2 hours",
"P3 Medium",
"ML Platform",
"1. ตรวจ Source Data Pipeline 2.Retry Feature Job"),
AlertRule("GPUNodeNotReady",
"kube_node_status_condition{condition='Ready'}",
"!= 1 for 5m",
"P1 Critical",
"ML Infra",
"1.Drain Node 2. ตรวจ GPU Driver 3.Replace Node"),
]
print("=== ML Alert Rules ===")
for r in rules:
print(f" [{r.alert}] Priority: {r.priority}")
print(f" Metric: {r.metric}")
print(f" Threshold: {r.threshold}")
print(f" Team: {r.team}")
print(f" Runbook: {r.runbook}")
On-call & Escalation
# === On-call Schedule ===
@dataclass
class OnCallSchedule:
team: str
rotation: str
primary: str
secondary: str
escalation: str
quiet_hours: str
schedules = [
OnCallSchedule("ML Infrastructure",
"Weekly Rotation (Mon 09:00)",
"ML Infra Engineer (5 min acknowledge)",
"Senior ML Infra Engineer (10 min)",
"ML Engineering Manager (15 min)",
"P3-P5 Email Only 22:00-08:00"),
OnCallSchedule("ML Platform",
"Weekly Rotation (Mon 09:00)",
"ML Platform Engineer (5 min)",
"Senior Platform Engineer (10 min)",
"Platform Lead (15 min)",
"P3-P5 Email Only 22:00-08:00"),
OnCallSchedule("Data Science",
"Daily Rotation",
"Data Scientist (10 min acknowledge)",
"Senior Data Scientist (15 min)",
"DS Manager (20 min)",
"P4-P5 Suppress 20:00-09:00"),
]
print("=== On-call Schedules ===")
for s in schedules:
print(f"\n [{s.team}] Rotation: {s.rotation}")
print(f" Primary: {s.primary}")
print(f" Secondary: {s.secondary}")
print(f" Escalation: {s.escalation}")
print(f" Quiet Hours: {s.quiet_hours}")
Automation & Metrics
# === Alert Automation ===
# Opsgenie API - Create Alert
# import requests
# url = "https://api.opsgenie.com/v2/alerts"
# headers = {"Authorization": "GenieKey YOUR_API_KEY"}
# payload = {
# "message": "Model Accuracy Drop - churn_model",
# "priority": "P2",
# "tags": ["ml", "model-quality", "data-science"],
# "details": {"model": "churn_model", "accuracy": "0.82", "baseline": "0.90"},
# "entity": "churn_model",
# "alias": "model-accuracy-churn",
# "description": "Accuracy dropped from 0.90 to 0.82. Check data drift.",
# "actions": ["Retrain", "Rollback", "Investigate"]
# }
# response = requests.post(url, json=payload, headers=headers)
@dataclass
class AlertMetric:
metric: str
target: str
current: str
improvement: str
metrics = [
AlertMetric("MTTA (Mean Time to Acknowledge)",
"< 5 minutes",
"ตรวจจาก Opsgenie Analytics",
"ปรับ Notification Channel Phone > Push > Email"),
AlertMetric("MTTR (Mean Time to Resolve)",
"< 30 minutes (P1) < 2 hours (P2)",
"ตรวจจาก Opsgenie Reports",
"เพิ่ม Runbook Auto-remediation"),
AlertMetric("False Positive Rate",
"< 5%",
"ตรวจจาก Alert ที่ Close เป็น False",
"Tune Threshold ทุกเดือน"),
AlertMetric("Alert Volume",
"< 20 alerts/week/team",
"ตรวจจาก Opsgenie Dashboard",
"Dedup Correlation ลบ Non-actionable"),
AlertMetric("Escalation Rate",
"< 10%",
"ตรวจจาก Escalation Count",
"Training Runbook ให้ Primary แก้ได้เอง"),
]
print("=== Alert Metrics ===")
for m in metrics:
print(f" [{m.metric}] Target: {m.target}")
print(f" Current: {m.current}")
print(f" Improve: {m.improvement}")
เคล็ดลับ
- Actionable: ทุก Alert ต้องมี Action ชัดเจน ไม่ Alert เฉยๆ
- Runbook: แนบ Runbook ทุก Alert ลด MTTR
- Dedup: ใช้ Alert Key ป้องกัน Alert ซ้ำ
- Tune: ปรับ Threshold ทุกเดือน ลด False Positive
- Automate: Auto-remediation สำหรับ Alert ที่แก้ได้อัตโนมัติ
การนำไปใช้งานจริงในองค์กร
สำหรับองค์กรขนาดกลางถึงใหญ่ แนะนำให้ใช้หลัก Three-Tier Architecture คือ Core Layer ที่เป็นแกนกลางของระบบ Distribution Layer ที่ทำหน้าที่กระจาย Traffic และ Access Layer ที่เชื่อมต่อกับผู้ใช้โดยตรง การแบ่ง Layer ชัดเจนช่วยให้การ Troubleshoot ง่ายขึ้นและสามารถ Scale ระบบได้ตามความต้องการ
เรื่อง Network Security ก็สำคัญไม่แพ้กัน ควรติดตั้ง Next-Generation Firewall ที่สามารถ Deep Packet Inspection ได้ ใช้ Network Segmentation แยก VLAN สำหรับแต่ละแผนก ติดตั้ง IDS/IPS เพื่อตรวจจับการโจมตี และทำ Regular Security Audit อย่างน้อยปีละ 2 ครั้ง
Opsgenie คืออะไร
Atlassian Incident Management Alert On-call Escalation 200+ Integration Prometheus Grafana Slack SMS Phone Heartbeat API Free Essentials
MLOps Alert ทำอย่างไร
Training Failed Model Quality Drop Data Drift Serving Latency Feature Stale GPU Down Prometheus AlertManager MLflow Airflow CloudWatch
On-call จัดอย่างไร
Weekly Rotation Primary 5min Secondary 10min Escalation Manager ML Infra Platform Data Science Quiet Hours Follow-the-sun Runbook
Alert Best Practices มีอะไร
Actionable Priority Dedup Correlation Threshold Tune Runbook Auto-remediation Review Monthly MTTA MTTR False Positive < 5% Volume
สรุป
Opsgenie Alert MLOps On-call Escalation Prometheus ML Pipeline Training Serving Drift Runbook MTTA MTTR Auto-remediation Production
