SiamCafe.net Blog
Cybersecurity

Opsgenie Alert MLOps Workflow

opsgenie alert mlops workflow
Opsgenie Alert MLOps Workflow | SiamCafe Blog
2025-08-15· อ. บอม — SiamCafe.net· 10,007 คำ

Opsgenie MLOps Alert

Opsgenie Alert MLOps On-call Escalation Prometheus Grafana ML Pipeline Training Serving Drift Latency Runbook MTTR Production

Alert TypeSourcePriorityTeamAction
Training Job FailedAirflow / DatabricksP2ML Platformตรวจ Log Retry Job
Model Quality DropCustom MonitorP2Data Scienceตรวจ Data Drift Retrain
Data Drift DetectedPrometheus / CustomP3Data Scienceตรวจ Feature Distribution
Serving Latency HighPrometheusP1ML InfraScale Up Optimize Model
GPU Node DownKubernetes / CloudWatchP1ML InfraFailover Replace Node
Feature Store StaleCustom HeartbeatP3ML Platformตรวจ Pipeline Retry

Alert Configuration

# === Opsgenie Alert Setup ===

# Prometheus AlertManager → Opsgenie Integration
# alertmanager.yml:
# receivers:
#   - name: 'opsgenie-ml'
#     opsgenie_configs:
#       - api_key: 'YOUR_OPSGENIE_API_KEY'
#         message: '{{ .GroupLabels.alertname }}'
#         priority: '{{ if eq .GroupLabels.severity "critical" }}P1{{ else }}P3{{ end }}'
#         tags: 'ml,{{ .GroupLabels.team }}'
#         description: '{{ .CommonAnnotations.description }}'
#
# Prometheus Alert Rules:
# groups:
#   - name: ml-alerts
#     rules:
#       - alert: ModelServingLatencyHigh
#         expr: histogram_quantile(0.99, rate(model_serving_duration_seconds_bucket[5m])) > 0.5
#         for: 5m
#         labels:
#           severity: critical
#           team: ml-infra
#         annotations:
#           description: "Model serving P99 latency > 500ms"
#
#       - alert: ModelAccuracyDrop
#         expr: model_accuracy < 0.85
#         for: 15m
#         labels:
#           severity: warning
#           team: data-science

from dataclasses import dataclass

@dataclass
class AlertRule:
    alert: str
    metric: str
    threshold: str
    priority: str
    team: str
    runbook: str

rules = [
    AlertRule("ModelServingLatencyHigh",
        "histogram_quantile(0.99, model_serving_duration)",
        "> 500ms for 5m",
        "P1 Critical",
        "ML Infra",
        "1. ตรวจ Pod CPU/Memory 2.Scale HPA 3. ตรวจ Model Size"),
    AlertRule("ModelAccuracyDrop",
        "model_accuracy (Custom Metric)",
        "< 0.85 for 15m",
        "P2 High",
        "Data Science",
        "1.ตรวจ Data Drift 2.Compare Feature Distribution 3.Retrain"),
    AlertRule("TrainingJobFailed",
        "airflow_task_status == 'failed'",
        "Any failure",
        "P2 High",
        "ML Platform",
        "1.ตรวจ Log 2.ตรวจ GPU Memory 3.Retry Job"),
    AlertRule("FeatureStoreStale",
        "time() - feature_last_updated_timestamp",
        "> 2 hours",
        "P3 Medium",
        "ML Platform",
        "1. ตรวจ Source Data Pipeline 2.Retry Feature Job"),
    AlertRule("GPUNodeNotReady",
        "kube_node_status_condition{condition='Ready'}",
        "!= 1 for 5m",
        "P1 Critical",
        "ML Infra",
        "1.Drain Node 2. ตรวจ GPU Driver 3.Replace Node"),
]

print("=== ML Alert Rules ===")
for r in rules:
    print(f"  [{r.alert}] Priority: {r.priority}")
    print(f"    Metric: {r.metric}")
    print(f"    Threshold: {r.threshold}")
    print(f"    Team: {r.team}")
    print(f"    Runbook: {r.runbook}")

On-call & Escalation

# === On-call Schedule ===

@dataclass
class OnCallSchedule:
    team: str
    rotation: str
    primary: str
    secondary: str
    escalation: str
    quiet_hours: str

schedules = [
    OnCallSchedule("ML Infrastructure",
        "Weekly Rotation (Mon 09:00)",
        "ML Infra Engineer (5 min acknowledge)",
        "Senior ML Infra Engineer (10 min)",
        "ML Engineering Manager (15 min)",
        "P3-P5 Email Only 22:00-08:00"),
    OnCallSchedule("ML Platform",
        "Weekly Rotation (Mon 09:00)",
        "ML Platform Engineer (5 min)",
        "Senior Platform Engineer (10 min)",
        "Platform Lead (15 min)",
        "P3-P5 Email Only 22:00-08:00"),
    OnCallSchedule("Data Science",
        "Daily Rotation",
        "Data Scientist (10 min acknowledge)",
        "Senior Data Scientist (15 min)",
        "DS Manager (20 min)",
        "P4-P5 Suppress 20:00-09:00"),
]

print("=== On-call Schedules ===")
for s in schedules:
    print(f"\n  [{s.team}] Rotation: {s.rotation}")
    print(f"    Primary: {s.primary}")
    print(f"    Secondary: {s.secondary}")
    print(f"    Escalation: {s.escalation}")
    print(f"    Quiet Hours: {s.quiet_hours}")

Automation & Metrics

# === Alert Automation ===

# Opsgenie API - Create Alert
# import requests
# url = "https://api.opsgenie.com/v2/alerts"
# headers = {"Authorization": "GenieKey YOUR_API_KEY"}
# payload = {
#     "message": "Model Accuracy Drop - churn_model",
#     "priority": "P2",
#     "tags": ["ml", "model-quality", "data-science"],
#     "details": {"model": "churn_model", "accuracy": "0.82", "baseline": "0.90"},
#     "entity": "churn_model",
#     "alias": "model-accuracy-churn",
#     "description": "Accuracy dropped from 0.90 to 0.82. Check data drift.",
#     "actions": ["Retrain", "Rollback", "Investigate"]
# }
# response = requests.post(url, json=payload, headers=headers)

@dataclass
class AlertMetric:
    metric: str
    target: str
    current: str
    improvement: str

metrics = [
    AlertMetric("MTTA (Mean Time to Acknowledge)",
        "< 5 minutes",
        "ตรวจจาก Opsgenie Analytics",
        "ปรับ Notification Channel Phone > Push > Email"),
    AlertMetric("MTTR (Mean Time to Resolve)",
        "< 30 minutes (P1) < 2 hours (P2)",
        "ตรวจจาก Opsgenie Reports",
        "เพิ่ม Runbook Auto-remediation"),
    AlertMetric("False Positive Rate",
        "< 5%",
        "ตรวจจาก Alert ที่ Close เป็น False",
        "Tune Threshold ทุกเดือน"),
    AlertMetric("Alert Volume",
        "< 20 alerts/week/team",
        "ตรวจจาก Opsgenie Dashboard",
        "Dedup Correlation ลบ Non-actionable"),
    AlertMetric("Escalation Rate",
        "< 10%",
        "ตรวจจาก Escalation Count",
        "Training Runbook ให้ Primary แก้ได้เอง"),
]

print("=== Alert Metrics ===")
for m in metrics:
    print(f"  [{m.metric}] Target: {m.target}")
    print(f"    Current: {m.current}")
    print(f"    Improve: {m.improvement}")

เคล็ดลับ

การนำไปใช้งานจริงในองค์กร

สำหรับองค์กรขนาดกลางถึงใหญ่ แนะนำให้ใช้หลัก Three-Tier Architecture คือ Core Layer ที่เป็นแกนกลางของระบบ Distribution Layer ที่ทำหน้าที่กระจาย Traffic และ Access Layer ที่เชื่อมต่อกับผู้ใช้โดยตรง การแบ่ง Layer ชัดเจนช่วยให้การ Troubleshoot ง่ายขึ้นและสามารถ Scale ระบบได้ตามความต้องการ

เรื่อง Network Security ก็สำคัญไม่แพ้กัน ควรติดตั้ง Next-Generation Firewall ที่สามารถ Deep Packet Inspection ได้ ใช้ Network Segmentation แยก VLAN สำหรับแต่ละแผนก ติดตั้ง IDS/IPS เพื่อตรวจจับการโจมตี และทำ Regular Security Audit อย่างน้อยปีละ 2 ครั้ง

Opsgenie คืออะไร

Atlassian Incident Management Alert On-call Escalation 200+ Integration Prometheus Grafana Slack SMS Phone Heartbeat API Free Essentials

MLOps Alert ทำอย่างไร

Training Failed Model Quality Drop Data Drift Serving Latency Feature Stale GPU Down Prometheus AlertManager MLflow Airflow CloudWatch

On-call จัดอย่างไร

Weekly Rotation Primary 5min Secondary 10min Escalation Manager ML Infra Platform Data Science Quiet Hours Follow-the-sun Runbook

Alert Best Practices มีอะไร

Actionable Priority Dedup Correlation Threshold Tune Runbook Auto-remediation Review Monthly MTTA MTTR False Positive < 5% Volume

สรุป

Opsgenie Alert MLOps On-call Escalation Prometheus ML Pipeline Training Serving Drift Runbook MTTA MTTR Auto-remediation Production

📖 บทความที่เกี่ยวข้อง

Opsgenie Alert Certification Pathอ่านบทความ → Opsgenie Alert Home Lab Setupอ่านบทความ → Opsgenie Alert อ่านบทความ → Opsgenie Alert Data Pipeline ETLอ่านบทความ → Opsgenie Alert Event Driven Designอ่านบทความ →

📚 ดูบทความทั้งหมด →