Betteruptime MLOps Workflow

Better Uptime คืออะไรและใช้ Monitor MLOps Pipeline อย่างไร

Better Uptime เป็น incident management platform ที่รวม uptime monitoring, status page และ on-call scheduling ไว้ในที่เดียว เมื่อนำมาใช้กับ MLOps workflow จะช่วยตรวจจับปัญหาได้ตั้งแต่ model serving endpoint ล่ม, inference latency พุ่ง ไปจนถึง data pipeline ค้าง ซึ่งเป็นปัญหาที่พบบ่อยในระบบ ML production

ต่างจาก monitoring ทั่วไปตรงที่ MLOps ต้อง monitor ทั้ง infrastructure layer (server, GPU, memory) และ ML-specific metrics (model accuracy drift, prediction latency, feature store freshness) Better Uptime ช่วยจัดการฝั่ง infrastructure ส่วน ML metrics ใช้ร่วมกับ Prometheus/Grafana

ตั้งค่า Better Uptime Monitor สำหรับ ML Endpoints

สร้าง monitor สำหรับ model serving API ทั้ง health check และ inference endpoint

เนื้อหาเกี่ยวข้อง — ดูเพิ่มเติมเรื่อง หาเปอร์เซ็นต์ส่วนลด — วิธีตั้งค่าและใช้งานจริงพร้อมตัวอย่าง

# Better Uptime API — สร้าง monitors ผ่าน API

# ติดตั้ง httpie สำหรับเรียก API สะดวก

sudo apt install httpie



# สร้าง HTTP monitor สำหรับ model health endpoint

http POST https://betteruptime.com/api/v2/monitors \

  Authorization:"Bearer YOUR_API_TOKEN" \

  monitor_type="keyword" \

  url="https://ml-api.example.com/health" \

  keyword_type="present" \

  keyword_value="healthy" \

  check_frequency=30 \

  request_timeout=15 \

  regions="us, eu, asia" \

  confirmation_period=0 \

  pronounceable_name="ML Model Health Check"



# สร้าง monitor สำหรับ inference endpoint (ตรวจ response time)

http POST https://betteruptime.com/api/v2/monitors \

  Authorization:"Bearer YOUR_API_TOKEN" \

  monitor_type="expected_status_code" \

  url="https://ml-api.example.com/predict" \

  expected_status_codes:='[200]' \

  check_frequency=60 \

  request_timeout=30 \

  http_method="POST" \

  request_headers:='[{"name":"Content-Type","value":"application/json"}]' \

  request_body='{"features":[1.0,2.0,3.0]}' \

  pronounceable_name="ML Inference Endpoint"

สร้าง MLOps Pipeline พร้อม Health Checks

ออกแบบ ML pipeline ที่มี health check ทุกจุด ให้ Better Uptime monitor ได้ครบวงจร

# ml_pipeline/serve.py — FastAPI model serving พร้อม health endpoints

from fastapi import FastAPI, HTTPException

from pydantic import BaseModel

import joblib

import numpy as np

import time

import os



app = FastAPI(title="ML Model Service")



# โหลด model ตอน startup

MODEL_PATH = os.getenv("MODEL_PATH", "./models/production/model.joblib")

model = None

model_loaded_at = None

prediction_count = 0

error_count = 0



@app.on_event("startup")

async def load_model():

    global model, model_loaded_at

    model = joblib.load(MODEL_PATH)

    model_loaded_at = time.time()

    print(f"Model loaded from {MODEL_PATH}")



class PredictRequest(BaseModel):

    features: list[float]



class HealthResponse(BaseModel):

    status: str

    model_loaded: bool

    uptime_seconds: float

    prediction_count: int

    error_rate: float



@app.get("/health", response_model=HealthResponse)

async def health_check():

    """Better Uptime จะเรียก endpoint นี้ทุก 30 วินาที"""

    global prediction_count, error_count

    uptime = time.time() - model_loaded_at if model_loaded_at else 0

    err_rate = error_count / max(prediction_count, 1)



    if model is None:

        raise HTTPException(status_code=503, detail="Model not loaded")

    if err_rate > 0.1:

        raise HTTPException(status_code=503, detail=f"High error rate: {err_rate:.2%}")



    return HealthResponse(

        status="healthy",

        model_loaded=True,

        uptime_seconds=uptime,

        prediction_count=prediction_count,

        error_rate=err_rate,

    )



@app.post("/predict")

async def predict(req: PredictRequest):

    global prediction_count, error_count

    prediction_count += 1

    try:

        features = np.array(req.features).reshape(1, -1)

        start = time.perf_counter()

        result = model.predict(features)

        latency = time.perf_counter() - start



        return {

            "prediction": result.tolist(),

            "latency_ms": round(latency * 1000, 2),

            "model_version": os.getenv("MODEL_VERSION", "unknown"),

        }

    except Exception as e:

        error_count += 1

        raise HTTPException(status_code=500, detail=str(e))

# Dockerfile สำหรับ model serving

FROM python:3.11-slim



WORKDIR /app

COPY requirements.txt .

RUN pip install --no-cache-dir -r requirements.txt



COPY ml_pipeline/ ./ml_pipeline/

COPY models/ ./models/



ENV MODEL_PATH=/app/models/production/model.joblib

ENV MODEL_VERSION=v1.2.0



EXPOSE 8000

CMD ["uvicorn", "ml_pipeline.serve:app", "--host", "0.0.0.0", "--port", "8000", "--workers", "4"]

ตั้งค่า On-Call Schedule สำหรับทีม MLOps

# สร้าง on-call calendar ผ่าน Better Uptime API

# ทีม MLOps หมุนเวร weekly



# สร้าง escalation policy

http POST https://betteruptime.com/api/v2/policies \

  Authorization:"Bearer YOUR_API_TOKEN" \

  name="MLOps Escalation" \

  repeat_count=3 \

  repeat_delay=300 \

  steps:='[

    {

      "type": "escalation_step",

      "wait_before": 0,

      "urgency_id": null,

      "step_members": [

        {"type": "current_on_call", "on_call_calendar_id": "CALENDAR_ID"}

      ]

    },

    {

      "type": "escalation_step",

      "wait_before": 300,

      "step_members": [

        {"type": "all_slack_integrations"}

      ]

    },

    {

      "type": "escalation_step",

      "wait_before": 600,

      "step_members": [

        {"type": "entire_team"}

      ]

    }

  ]'

Data Pipeline Monitoring ด้วย Custom Heartbeat

สำหรับ batch processing pipeline ที่ไม่มี HTTP endpoint ใช้ heartbeat monitor แทน ถ้า pipeline ไม่ส่ง heartbeat ภายในเวลาที่กำหนดจะ alert ทันที

แนะนำเพิ่มเติม — SiamCafeBook

# สร้าง heartbeat monitor

http POST https://betteruptime.com/api/v2/heartbeats \

  Authorization:"Bearer YOUR_API_TOKEN" \

  name="Daily Model Retrain Pipeline" \

  period=86400 \

  grace=3600 \

  pronounceable_name="Model Retrain Heartbeat"



# Response จะได้ heartbeat URL

# https://betteruptime.com/api/v1/heartbeat/xxxxx

# ml_pipeline/train.py — Training pipeline ที่ส่ง heartbeat

import requests

import joblib

from sklearn.ensemble import GradientBoostingClassifier

from sklearn.model_selection import train_test_split

from sklearn.metrics import accuracy_score, f1_score

import pandas as pd

import os

import sys



HEARTBEAT_URL = os.getenv("HEARTBEAT_URL", "https://betteruptime.com/api/v1/heartbeat/xxxxx")

ALERT_WEBHOOK = os.getenv("ALERT_WEBHOOK")



def send_heartbeat():

    """ส่ง heartbeat บอก Better Uptime ว่า pipeline ยังทำงานอยู่"""

    try:

        requests.get(HEARTBEAT_URL, timeout=10)

        print("Heartbeat sent successfully")

    except Exception as e:

        print(f"Failed to send heartbeat: {e}")



def alert_failure(message):

    """แจ้ง incident ผ่าน Better Uptime API เมื่อ pipeline fail"""

    if ALERT_WEBHOOK:

        requests.post(ALERT_WEBHOOK, json={

            "requester_email": "mlops@company.com",

            "name": "Model Retrain Failed",

            "summary": message,

            "description": f"Pipeline error: {message}",

        })



def train_pipeline():

    print("=== Starting Model Retrain Pipeline ===")



    # 1. Load data

    print("[1/5] Loading training data...")

    df = pd.read_parquet("/data/features/latest/")

    print(f"  Loaded {len(df)} rows, {len(df.columns)} features")



    # 2. Split data

    print("[2/5] Splitting data...")

    X = df.drop(columns=["target"])

    y = df["target"]

    X_train, X_test, y_train, y_test = train_test_split(

        X, y, test_size=0.2, random_state=42, stratify=y

    )



    # 3. Train model

    print("[3/5] Training model...")

    model = GradientBoostingClassifier(

        n_estimators=200,

        max_depth=6,

        learning_rate=0.1,

        subsample=0.8,

        random_state=42,

    )

    model.fit(X_train, y_train)



    # 4. Evaluate

    print("[4/5] Evaluating model...")

    y_pred = model.predict(X_test)

    accuracy = accuracy_score(y_test, y_pred)

    f1 = f1_score(y_test, y_pred, average="weighted")

    print(f"  Accuracy: {accuracy:.4f}")

    print(f"  F1 Score: {f1:.4f}")



    # ตรวจสอบว่า model ดีพอจะ deploy

    MIN_ACCURACY = 0.85

    if accuracy < MIN_ACCURACY:

        msg = f"Model accuracy {accuracy:.4f} below threshold {MIN_ACCURACY}"

        alert_failure(msg)

        sys.exit(1)



    # 5. Save model

    print("[5/5] Saving model...")

    version = os.getenv("MODEL_VERSION", "dev")

    output_path = f"/models/{version}/model.joblib"

    os.makedirs(os.path.dirname(output_path), exist_ok=True)

    joblib.dump(model, output_path)

    print(f"  Saved to {output_path}")



    # ส่ง heartbeat เมื่อสำเร็จ

    send_heartbeat()

    print("=== Pipeline completed successfully ===")



if __name__ == "__main__":

    try:

        train_pipeline()

    except Exception as e:

        alert_failure(str(e))

        raise

สร้าง Status Page สำหรับ ML Services

# สร้าง status page แสดงสถานะ ML services ทั้งหมด

http POST https://betteruptime.com/api/v2/status-pages \

  Authorization:"Bearer YOUR_API_TOKEN" \

  company_name="ML Platform" \

  company_url="https://ml-platform.example.com" \

  subdomain="ml-status" \

  timezone="Asia/Bangkok" \

  subscribable=true



# เพิ่ม resources เข้า status page

http POST https://betteruptime.com/api/v2/status-pages/STATUS_PAGE_ID/resources \

  Authorization:"Bearer YOUR_API_TOKEN" \

  resource_id="MONITOR_ID_1" \

  resource_type="Monitor" \

  public_name="Model Inference API" \

  widget_type="history"



# เพิ่ม heartbeat

http POST https://betteruptime.com/api/v2/status-pages/STATUS_PAGE_ID/resources \

  Authorization:"Bearer YOUR_API_TOKEN" \

  resource_id="HEARTBEAT_ID_1" \

  resource_type="Heartbeat" \

  public_name="Daily Model Retrain" \

  widget_type="plain"

Docker Compose สำหรับ MLOps Stack ทั้งหมด

# docker-compose.prod.yml

version: '3.8'

services:

  ml-api:

    build: .

    ports:

      - "8000:8000"

    environment:

      - MODEL_PATH=/app/models/production/model.joblib

      - MODEL_VERSION=v1.2.0

    volumes:

      - model_store:/app/models

    deploy:

      replicas: 3

      resources:

        limits:

          cpus: '2'

          memory: 4G

    healthcheck:

      test: ["CMD", "curl", "-f", "http://localhost:8000/health"]

      interval: 30s

      timeout: 10s

      retries: 3



  prometheus:

    image: prom/prometheus:v2.50.0

    ports:

      - "9090:9090"

    volumes:

      - ./prometheus.yml:/etc/prometheus/prometheus.yml

    command:

      - '--config.file=/etc/prometheus/prometheus.yml'

      - '--storage.tsdb.retention.time=30d'



  grafana:

    image: grafana/grafana:10.3.0

    ports:

      - "3000:3000"

    environment:

      - GF_SECURITY_ADMIN_PASSWORD=SecureGrafanaPass!

    volumes:

      - grafana_data:/var/lib/grafana



volumes:

  model_store:

  grafana_data:

# prometheus.yml — scrape ML metrics

global:

  scrape_interval: 15s



scrape_configs:

  - job_name: 'ml-api'

    static_configs:

      - targets: ['ml-api:8000']

    metrics_path: '/metrics'



  - job_name: 'node-exporter'

    static_configs:

      - targets: ['node-exporter:9100']

FAQ — คำถามที่พบบ่อย

Q: Better Uptime กับ PagerDuty ต่างกันอย่างไร?

เนื้อหาเกี่ยวข้อง — แนะนำให้อ่าน หูฟัง gaming 2020 — ทุกสิ่งที่ต้องรู้ในปี 2026

A: Better Uptime เน้น uptime monitoring + status page + on-call ในตัวเดียว ราคาถูกกว่า PagerDuty มาก เหมาะกับทีมเล็ก-กลาง PagerDuty เหมาะกับองค์กรใหญ่ที่ต้องการ workflow ซับซ้อนและ integration จำนวนมาก

Q: ต้อง monitor อะไรบ้างใน MLOps?

A: อย่างน้อยต้อง monitor 4 ระดับ: (1) Infrastructure — CPU, GPU, memory, disk (2) Application — API latency, error rate, throughput (3) Data — freshness, schema drift, missing values (4) Model — prediction drift, accuracy degradation, feature importance shift

แนะนำเพิ่มเติม — ดูสัญญาณเทรดที่ XM Signal

เนื้อหาเกี่ยวข้อง — Apache Kafka Streams Post-mortem Analysis

Q: Heartbeat monitor กับ HTTP monitor ต่างกันตรงไหน?

A: HTTP monitor คือ Better Uptime เป็นฝ่ายเรียกมาตรวจสอบ ใช้กับ web service ที่มี endpoint Heartbeat monitor คือ application เป็นฝ่ายส่งสัญญาณไปบอก ใช้กับ batch job หรือ cron ที่ไม่มี HTTP endpoint ให้เรียก

Q: Free plan ของ Better Uptime เพียงพอไหมสำหรับ MLOps?

เนื้อหาเกี่ยวข้อง — อ่านต่อ: Zipkin Tracing Cache Strategy Redis

A: Free plan ได้ 10 monitors, 3 นาที check interval ถ้ามี ML endpoint แค่ 2-3 ตัวก็พอ แต่ถ้าต้องการ heartbeat monitors, status page, on-call scheduling ต้องใช้ paid plan เริ่มต้นที่ $20/เดือน