Opsgenie Alert MLOps Workflow — จัดการ Alert สำหรับ ML Pipeline

Opsgenie MLOps Alert

Opsgenie Alert MLOps On-call Escalation Prometheus Grafana ML Pipeline Training Serving Drift Latency Runbook MTTR Production

Alert Type	Source	Priority	Team	Action
Training Job Failed	Airflow / Databricks	P2	ML Platform	ตรวจ Log Retry Job
Model Quality Drop	Custom Monitor	P2	Data Science	ตรวจ Data Drift Retrain
Data Drift Detected	Prometheus / Custom	P3	Data Science	ตรวจ Feature Distribution
Serving Latency High	Prometheus	P1	ML Infra	Scale Up Optimize Model
GPU Node Down	Kubernetes / CloudWatch	P1	ML Infra	Failover Replace Node
Feature Store Stale	Custom Heartbeat	P3	ML Platform	ตรวจ Pipeline Retry

Alert Configuration

# === Opsgenie Alert Setup ===

# Prometheus AlertManager → Opsgenie Integration
# alertmanager.yml:
# receivers:
# - name: 'opsgenie-ml'
# opsgenie_configs:
# - api_key: 'YOUR_OPSGENIE_API_KEY'
# message: '{{ .GroupLabels.alertname }}'
# priority: '{{ if eq .GroupLabels.severity "critical" }}P1{{ else }}P3{{ end }}'
# tags: 'ml,{{ .GroupLabels.team }}'
# description: '{{ .CommonAnnotations.description }}'
#
# Prometheus Alert Rules:
# groups:
# - name: ml-alerts
# rules:
# - alert: ModelServingLatencyHigh
# expr: histogram_quantile(0.99, rate(model_serving_duration_seconds_bucket[5m])) > 0.5
# for: 5m
# labels:
# severity: critical
# team: ml-infra
# annotations:
# description: "Model serving P99 latency > 500ms"
#
# - alert: ModelAccuracyDrop
# expr: model_accuracy < 0.85
# for: 15m
# labels:
# severity: warning
# team: data-science

from dataclasses import dataclass

@dataclass
class AlertRule:
 alert: str
 metric: str
 threshold: str
 priority: str
 team: str
 runbook: str

rules = [
 AlertRule("ModelServingLatencyHigh",
 "histogram_quantile(0.99, model_serving_duration)",
 "> 500ms for 5m",
 "P1 Critical",
 "ML Infra",
 "1. ตรวจ Pod CPU/Memory 2.Scale HPA 3. ตรวจ Model Size"),
 AlertRule("ModelAccuracyDrop",
 "model_accuracy (Custom Metric)",
 "< 0.85 for 15m",
 "P2 High",
 "Data Science",
 "1.ตรวจ Data Drift 2.Compare Feature Distribution 3.Retrain"),
 AlertRule("TrainingJobFailed",
 "airflow_task_status == 'failed'",
 "Any failure",
 "P2 High",
 "ML Platform",
 "1.ตรวจ Log 2.ตรวจ GPU Memory 3.Retry Job"),
 AlertRule("FeatureStoreStale",
 "time() - feature_last_updated_timestamp",
 "> 2 hours",
 "P3 Medium",
 "ML Platform",
 "1. ตรวจ Source Data Pipeline 2.Retry Feature Job"),
 AlertRule("GPUNodeNotReady",
 "kube_node_status_condition{condition='Ready'}",
 "!= 1 for 5m",
 "P1 Critical",
 "ML Infra",
 "1.Drain Node 2. ตรวจ GPU Driver 3.Replace Node"),
]

print("=== ML Alert Rules ===")
for r in rules:
 print(f" [{r.alert}] Priority: {r.priority}")
 print(f" Metric: {r.metric}")
 print(f" Threshold: {r.threshold}")
 print(f" Team: {r.team}")
 print(f" Runbook: {r.runbook}")

On-call & Escalation

# === On-call Schedule ===

@dataclass
class OnCallSchedule:
 team: str
 rotation: str
 primary: str
 secondary: str
 escalation: str
 quiet_hours: str

schedules = [
 OnCallSchedule("ML Infrastructure",
 "Weekly Rotation (Mon 09:00)",
 "ML Infra Engineer (5 min acknowledge)",
 "Senior ML Infra Engineer (10 min)",
 "ML Engineering Manager (15 min)",
 "P3-P5 Email Only 22:00-08:00"),
 OnCallSchedule("ML Platform",
 "Weekly Rotation (Mon 09:00)",
 "ML Platform Engineer (5 min)",
 "Senior Platform Engineer (10 min)",
 "Platform Lead (15 min)",
 "P3-P5 Email Only 22:00-08:00"),
 OnCallSchedule("Data Science",
 "Daily Rotation",
 "Data Scientist (10 min acknowledge)",
 "Senior Data Scientist (15 min)",
 "DS Manager (20 min)",
 "P4-P5 Suppress 20:00-09:00"),
]

print("=== On-call Schedules ===")
for s in schedules:
 print(f"\n [{s.team}] Rotation: {s.rotation}")
 print(f" Primary: {s.primary}")
 print(f" Secondary: {s.secondary}")
 print(f" Escalation: {s.escalation}")
 print(f" Quiet Hours: {s.quiet_hours}")

Automation & Metrics

# === Alert Automation ===

# Opsgenie API - Create Alert
# import requests
# url = "https://api.opsgenie.com/v2/alerts"
# headers = {"Authorization": "GenieKey YOUR_API_KEY"}
# payload = {
# "message": "Model Accuracy Drop - churn_model",
# "priority": "P2",
# "tags": ["ml", "model-quality", "data-science"],
# "details": {"model": "churn_model", "accuracy": "0.82", "baseline": "0.90"},
# "entity": "churn_model",
# "alias": "model-accuracy-churn",
# "description": "Accuracy dropped from 0.90 to 0.82. Check data drift.",
# "actions": ["Retrain", "Rollback", "Investigate"]
# }
# response = requests.post(url, json=payload, headers=headers)

@dataclass
class AlertMetric:
 metric: str
 target: str
 current: str
 improvement: str

metrics = [
 AlertMetric("MTTA (Mean Time to Acknowledge)",
 "< 5 minutes",
 "ตรวจจาก Opsgenie Analytics",
 "ปรับ Notification Channel Phone > Push > Email"),
 AlertMetric("MTTR (Mean Time to Resolve)",
 "< 30 minutes (P1) < 2 hours (P2)",
 "ตรวจจาก Opsgenie Reports",
 "เพิ่ม Runbook Auto-remediation"),
 AlertMetric("False Positive Rate",
 "< 5%",
 "ตรวจจาก Alert ที่ Close เป็น False",
 "Tune Threshold ทุกเดือน"),
 AlertMetric("Alert Volume",
 "< 20 alerts/week/team",
 "ตรวจจาก Opsgenie Dashboard",
 "Dedup Correlation ลบ Non-actionable"),
 AlertMetric("Escalation Rate",
 "< 10%",
 "ตรวจจาก Escalation Count",
 "Training Runbook ให้ Primary แก้ได้เอง"),
]

print("=== Alert Metrics ===")
for m in metrics:
 print(f" [{m.metric}] Target: {m.target}")
 print(f" Current: {m.current}")
 print(f" Improve: {m.improvement}")

เคล็ดลับ

Actionable: ทุก Alert ต้องมี Action ชัดเจน ไม่ Alert เฉยๆ
Runbook: แนบ Runbook ทุก Alert ลด MTTR
Dedup: ใช้ Alert Key ป้องกัน Alert ซ้ำ
Tune: ปรับ Threshold ทุกเดือน ลด False Positive
Automate: Auto-remediation สำหรับ Alert ที่แก้ได้อัตโนมัติ

การนำไปใช้งานจริงในองค์กร

สำหรับองค์กรขนาดกลางถึงใหญ่ แนะนำให้ใช้หลัก Three-Tier Architecture คือ Core Layer ที่เป็นแกนกลางของระบบ Distribution Layer ที่ทำหน้าที่กระจาย Traffic และ Access Layer ที่เชื่อมต่อกับผู้ใช้โดยตรง การแบ่ง Layer ชัดเจนช่วยให้การ Troubleshoot ง่ายขึ้นและสามารถ Scale ระบบได้ตามความต้องการ

เนื้อหาเกี่ยวข้อง — บทความที่เกี่ยวข้อง: Go Cobra CLI Audit Trail Logging

เรื่อง Network Security ก็สำคัญไม่แพ้กัน ควรติดตั้ง Next-Generation Firewall ที่สามารถ Deep Packet Inspection ได้ ใช้ Network Segmentation แยก VLAN สำหรับแต่ละแผนก ติดตั้ง IDS/IPS เพื่อตรวจจับการโจมตี และทำ Regular Security Audit อย่างน้อยปีละ 2 ครั้ง