SiamCafe.net Blog
Technology

Model Registry Chaos Engineering — ทดสอบความทนทานของ ML Systems

model registry chaos engineering
Model Registry Chaos Engineering | SiamCafe Blog
2026-03-14· อ. บอม — SiamCafe.net· 1,839 คำ

Model Registry คืออะไร

Model Registry เป็นระบบจัดการ Machine Learning models แบบ centralized เก็บ model versions, metadata, artifacts, lineage และ deployment status ทำให้ทีม Data Science และ ML Engineers จัดการ ML lifecycle ได้อย่างเป็นระบบ ตั้งแต่ training ไปจนถึง production deployment

เครื่องมือ Model Registry ที่นิยมได้แก่ MLflow Model Registry (open source, most popular), Weights and Biases (W&B) Registry, Amazon SageMaker Model Registry, Google Vertex AI Model Registry, Azure ML Model Registry แต่ละตัวมีจุดเด่นต่างกัน MLflow เหมาะสำหรับเริ่มต้นเพราะ open source และ flexible

Chaos Engineering สำหรับ ML systems สำคัญมากเพราะ ML pipelines มีจุดที่อาจล้มเหลวหลายจุด ตั้งแต่ data ingestion, feature extraction, model serving, prediction caching จนถึง fallback mechanisms การทำ chaos experiments ช่วยค้นพบจุดอ่อนก่อนที่จะเกิดปัญหาจริง

ติดตั้ง Model Registry

Setup MLflow Model Registry

# === MLflow Model Registry Setup ===

# 1. Install MLflow
pip install mlflow boto3 psycopg2-binary

# 2. Start MLflow Server with PostgreSQL backend
cat > docker-compose.yml << 'EOF'
version: '3.8'
services:
  mlflow:
    image: ghcr.io/mlflow/mlflow:v2.16.0
    ports:
      - "5000:5000"
    environment:
      - MLFLOW_BACKEND_STORE_URI=postgresql://mlflow:mlflow_pass@postgres:5432/mlflow
      - MLFLOW_DEFAULT_ARTIFACT_ROOT=s3://mlflow-artifacts/
      - AWS_ACCESS_KEY_ID=
      - AWS_SECRET_ACCESS_KEY=
    command: >
      mlflow server
      --host 0.0.0.0
      --port 5000
      --backend-store-uri postgresql://mlflow:mlflow_pass@postgres:5432/mlflow
      --default-artifact-root s3://mlflow-artifacts/
    depends_on:
      - postgres

  postgres:
    image: postgres:16
    environment:
      POSTGRES_DB: mlflow
      POSTGRES_USER: mlflow
      POSTGRES_PASSWORD: mlflow_pass
    volumes:
      - pg_data:/var/lib/postgresql/data

volumes:
  pg_data:
EOF

docker-compose up -d

# 3. Register a Model
cat > register_model.py << 'PYEOF'
import mlflow
from mlflow.tracking import MlflowClient

mlflow.set_tracking_uri("http://localhost:5000")
client = MlflowClient()

# Log model during training
with mlflow.start_run(run_name="training-v1") as run:
    # Train model (example)
    mlflow.log_param("learning_rate", 0.001)
    mlflow.log_param("epochs", 100)
    mlflow.log_metric("accuracy", 0.95)
    mlflow.log_metric("f1_score", 0.93)
    
    # Register model
    mlflow.sklearn.log_model(
        sk_model=None,  # your trained model
        artifact_path="model",
        registered_model_name="fraud-detection-model"
    )

# Transition model stage
client.transition_model_version_stage(
    name="fraud-detection-model",
    version=1,
    stage="Production"
)

print("Model registered and promoted to Production")
PYEOF

# 4. Serve Model
mlflow models serve -m "models:/fraud-detection-model/Production" -p 8001

echo "MLflow Model Registry configured"

Chaos Engineering สำหรับ ML Systems

Chaos Engineering concepts สำหรับ ML

# === Chaos Engineering for ML ===

# 1. ML System Failure Points
# ===================================
# Data Layer:
#   - Data source unavailable
#   - Data schema changed unexpectedly
#   - Data quality degradation (drift)
#   - Feature store latency spike
#
# Model Layer:
#   - Model registry unreachable
#   - Model artifact corrupted
#   - Model version mismatch
#   - OOM during inference (large batch)
#
# Serving Layer:
#   - Model server crash
#   - GPU failure
#   - High latency under load
#   - Prediction cache miss storm
#
# Pipeline Layer:
#   - Training pipeline failure
#   - Feature pipeline delay
#   - Orchestrator (Airflow) down
#   - Storage full

# 2. Chaos Experiment Types
# ===================================
# Infrastructure chaos:
#   - Kill model server pod
#   - Network partition between services
#   - CPU/memory stress on inference nodes
#   - Disk I/O latency injection
#
# Application chaos:
#   - Inject invalid model version
#   - Corrupt feature values
#   - Simulate model registry timeout
#   - Send malformed prediction requests
#
# Data chaos:
#   - Inject data drift
#   - Remove feature columns
#   - Delay data pipeline
#   - Corrupt training data

# 3. Steady State Hypothesis
# ===================================
# Before chaos: define what "normal" looks like
# Metrics to monitor:
#   - Prediction latency p99 < 100ms
#   - Error rate < 0.1%
#   - Model serving throughput > 1000 RPS
#   - Fallback activation rate < 5%
#   - Feature freshness < 5 minutes

# 4. Tools
# ===================================
# Chaos Mesh (Kubernetes): pod kill, network chaos, IO chaos
# Litmus Chaos: workflow-based chaos experiments
# Gremlin: commercial, easy to use
# Toxiproxy: network-level chaos (latency, packet loss)
# Custom scripts: application-level chaos

echo "Chaos engineering concepts"

เขียน Chaos Experiments

Implement chaos experiments

#!/usr/bin/env python3
# chaos_experiments.py — ML Chaos Experiments
import json
import random
import logging
from datetime import datetime
from typing import Dict, List

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger("chaos")

class MLChaosExperiment:
    def __init__(self, target_service):
        self.target = target_service
        self.results = []
    
    def define_experiment(self, name, hypothesis, method, rollback):
        return {
            "name": name,
            "target": self.target,
            "steady_state_hypothesis": hypothesis,
            "method": method,
            "rollback": rollback,
            "created_at": datetime.utcnow().isoformat(),
        }
    
    def model_server_crash(self):
        """Experiment: Kill model serving pod"""
        return self.define_experiment(
            name="model-server-crash",
            hypothesis={
                "title": "System handles model server failure gracefully",
                "probes": [
                    {"type": "http", "url": "/api/predict", "expect_status": 200, "timeout_ms": 200},
                    {"type": "metric", "query": "error_rate", "expect_below": 0.01},
                ],
            },
            method={
                "type": "pod_kill",
                "target": "deployment/model-server",
                "namespace": "ml-serving",
                "action": "kill 1 random pod",
                "duration_seconds": 60,
            },
            rollback={
                "action": "Kubernetes auto-restarts pod",
                "verify": "Check pod count restored and predictions working",
            },
        )
    
    def model_registry_timeout(self):
        """Experiment: Model registry becomes slow"""
        return self.define_experiment(
            name="registry-timeout",
            hypothesis={
                "title": "System uses cached model when registry is slow",
                "probes": [
                    {"type": "http", "url": "/api/predict", "expect_status": 200},
                    {"type": "metric", "query": "fallback_rate", "expect_below": 0.05},
                ],
            },
            method={
                "type": "network_delay",
                "target": "service/mlflow",
                "delay_ms": 5000,
                "duration_seconds": 120,
            },
            rollback={"action": "Remove network delay rule"},
        )
    
    def data_drift_injection(self):
        """Experiment: Inject data drift into feature pipeline"""
        return self.define_experiment(
            name="data-drift-injection",
            hypothesis={
                "title": "System detects and handles data drift",
                "probes": [
                    {"type": "metric", "query": "drift_detected", "expect": True},
                    {"type": "metric", "query": "prediction_quality", "expect_above": 0.80},
                    {"type": "alert", "name": "DataDriftAlert", "expect_fired": True},
                ],
            },
            method={
                "type": "data_mutation",
                "target": "feature-pipeline",
                "mutation": "Shift numerical features by 2 standard deviations",
                "affected_features": ["amount", "frequency", "recency"],
                "duration_seconds": 300,
            },
            rollback={"action": "Restore original feature pipeline"},
        )
    
    def run_experiment(self, experiment):
        """Simulate running a chaos experiment"""
        success = random.random() > 0.3  # 70% chance system handles it
        
        result = {
            "experiment": experiment["name"],
            "started_at": datetime.utcnow().isoformat(),
            "status": "passed" if success else "failed",
            "findings": [],
        }
        
        if not success:
            result["findings"] = [
                "System did not fallback gracefully",
                f"Error rate exceeded threshold during {experiment['method']['type']}",
                "Recommendation: Implement circuit breaker for model registry",
            ]
        
        self.results.append(result)
        return result

chaos = MLChaosExperiment("ml-serving")

exp1 = chaos.model_server_crash()
print("Experiment 1:", json.dumps(exp1["method"], indent=2))

exp2 = chaos.model_registry_timeout()
result = chaos.run_experiment(exp2)
print("\nResult:", json.dumps(result, indent=2))

exp3 = chaos.data_drift_injection()
print("\nDrift Experiment:", json.dumps(exp3["method"], indent=2))

Resilience Testing Pipeline

สร้าง pipeline สำหรับ resilience testing

# === Resilience Testing Pipeline ===

# 1. Chaos Mesh Experiment (Kubernetes)
cat > chaos/pod-kill.yaml << 'EOF'
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
  name: model-server-pod-kill
  namespace: ml-serving
spec:
  action: pod-kill
  mode: one
  selector:
    namespaces:
      - ml-serving
    labelSelectors:
      app: model-server
  duration: "60s"
  scheduler:
    cron: "@every 24h"
EOF

# 2. Network Chaos
cat > chaos/network-delay.yaml << 'EOF'
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: registry-network-delay
  namespace: ml-serving
spec:
  action: delay
  mode: all
  selector:
    namespaces:
      - ml-serving
    labelSelectors:
      app: mlflow-registry
  delay:
    latency: "3000ms"
    jitter: "1000ms"
  duration: "120s"
EOF

# 3. IO Chaos (Disk latency)
cat > chaos/io-stress.yaml << 'EOF'
apiVersion: chaos-mesh.org/v1alpha1
kind: IOChaos
metadata:
  name: model-storage-io-delay
  namespace: ml-serving
spec:
  action: latency
  mode: one
  selector:
    labelSelectors:
      app: model-server
  volumePath: /models
  delay: "500ms"
  duration: "120s"
EOF

# 4. Apply experiments
kubectl apply -f chaos/pod-kill.yaml
kubectl apply -f chaos/network-delay.yaml

# 5. Monitor during chaos
# Watch metrics:
# - kubectl top pods -n ml-serving
# - linkerd viz stat deploy -n ml-serving
# - curl http://model-server/health
# - curl http://model-server/metrics

# 6. CI/CD Integration
# Run chaos tests as part of staging deployment:
# deploy to staging → run chaos experiments → verify steady state → promote to production

# 7. Game Day Checklist
# ===================================
# [ ] Notify team about game day
# [ ] Verify monitoring dashboards ready
# [ ] Confirm rollback procedures documented
# [ ] Run experiments in staging first
# [ ] Start with smallest blast radius
# [ ] Gradually increase scope
# [ ] Document all findings
# [ ] Create action items for failures
# [ ] Schedule follow-up to verify fixes

echo "Resilience testing pipeline configured"

Monitoring และ Recovery

Monitor ML systems during chaos

#!/usr/bin/env python3
# chaos_monitor.py — Chaos Experiment Monitoring
import json
import logging
from datetime import datetime

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger("monitor")

class ChaosMonitor:
    def __init__(self):
        self.experiments = []
    
    def steady_state_check(self):
        return {
            "timestamp": datetime.utcnow().isoformat(),
            "metrics": {
                "prediction_latency_p99_ms": 85,
                "prediction_error_rate": 0.002,
                "model_server_rps": 1200,
                "fallback_rate": 0.01,
                "model_version": "v3.2.1",
                "feature_freshness_min": 2,
                "gpu_utilization_pct": 65,
            },
            "thresholds": {
                "prediction_latency_p99_ms": {"max": 200, "status": "ok"},
                "prediction_error_rate": {"max": 0.01, "status": "ok"},
                "model_server_rps": {"min": 500, "status": "ok"},
                "fallback_rate": {"max": 0.05, "status": "ok"},
            },
            "overall": "healthy",
        }
    
    def recovery_playbook(self):
        return {
            "model_server_down": {
                "auto_recovery": [
                    "Kubernetes restarts pod automatically",
                    "Load balancer routes to healthy pods",
                    "Prediction cache serves recent results",
                ],
                "manual_steps": [
                    "Check pod logs: kubectl logs -n ml-serving deploy/model-server",
                    "Check events: kubectl get events -n ml-serving",
                    "Scale up if needed: kubectl scale deploy/model-server --replicas=5",
                    "Verify predictions: curl http://model-server/api/predict",
                ],
                "escalation": "Page ML on-call if not recovered in 5 minutes",
            },
            "model_registry_down": {
                "auto_recovery": [
                    "Model server uses locally cached model",
                    "Circuit breaker prevents cascade failure",
                ],
                "manual_steps": [
                    "Check MLflow status: curl http://mlflow:5000/health",
                    "Check PostgreSQL: kubectl exec -it postgres -- pg_isready",
                    "Restart if needed: kubectl rollout restart deploy/mlflow",
                ],
            },
            "data_drift_detected": {
                "auto_recovery": [
                    "Alert sent to data team",
                    "Feature pipeline paused",
                    "Serving continues with last known good model",
                ],
                "manual_steps": [
                    "Investigate drift source",
                    "Validate data quality",
                    "Retrain model if needed",
                    "A/B test new model before full deployment",
                ],
            },
        }

monitor = ChaosMonitor()
steady = monitor.steady_state_check()
print("Steady State:", json.dumps(steady["metrics"], indent=2))

playbook = monitor.recovery_playbook()
print("\nRecovery:", json.dumps(playbook["model_server_down"]["auto_recovery"], indent=2))

FAQ คำถามที่พบบ่อย

Q: Chaos Engineering ปลอดภัยหรือเปล่า?

A: ปลอดภัยถ้าทำถูกวิธี เริ่มจาก staging environment ก่อนเสมอ ใช้ blast radius เล็กที่สุด (เช่น kill 1 pod ไม่ใช่ทั้ง deployment) มี rollback plan พร้อม มี monitoring ดูผลกระทบ real-time ทำใน business hours ที่ทีมพร้อม ค่อยๆ เพิ่ม scope เมื่อมั่นใจ ไม่ทำใน production จนกว่าจะผ่าน staging ทุก experiment Netflix, Google, Amazon ทำ chaos engineering ใน production ทุกวัน แต่พวกเขามี mature observability และ rollback systems

Q: Model Registry ควรมี features อะไรบ้าง?

A: Must-have Model versioning (เก็บทุก version), Stage management (staging, production, archived), Metadata tracking (metrics, parameters, training data), Artifact storage (model files, configs), API สำหรับ programmatic access Nice-to-have Model lineage (data → training → model), A/B testing integration, Approval workflow, Automated deployment trigger, Model comparison dashboard สำหรับเริ่มต้น MLflow ฟรีและมี features ครบ สำหรับ enterprise อาจต้อง managed service เช่น SageMaker หรือ Vertex AI

Q: Chaos experiment ควรทำบ่อยแค่ไหน?

A: ขึ้นกับ maturity ของระบบ เริ่มต้น ทำ quarterly (ทุก 3 เดือน) เป็น game day event ทีมร่วมกัน Intermediate ทำ monthly เป็น automated experiments ใน staging Advanced ทำ weekly หรือ continuous ใน production (automated) ทุกครั้งที่ deploy version ใหม่ ควรรัน chaos experiments ใน staging ก่อน promote สำหรับ ML systems ที่มีการ retrain model บ่อย ควรทำ chaos test ทุกครั้งที่ deploy model version ใหม่

Q: Fallback strategy สำหรับ ML serving ทำอย่างไร?

A: ควรมีหลายระดับ Level 1 ใช้ cached predictions สำหรับ requests ที่เคย predict แล้ว (Redis cache), Level 2 ใช้ simpler model ที่ inference เร็วกว่า (เช่น logistic regression แทน deep learning), Level 3 ใช้ rule-based system (business rules ที่ไม่ต้องใช้ ML), Level 4 return default/safe values ที่ไม่ก่อให้เกิดความเสียหาย ตั้ง circuit breaker ที่ model serving layer ถ้า error rate สูงให้ switch ไป fallback อัตโนมัติ monitor fallback rate ถ้าสูงกว่าปกติแสดงว่ามีปัญหา

📖 บทความที่เกี่ยวข้อง

Model Registry Monitoring และ Alertingอ่านบทความ → Ansible AWX Tower Chaos Engineeringอ่านบทความ → Model Registry Network Segmentationอ่านบทความ → Model Registry Domain Driven Design DDDอ่านบทความ → Model Registry IoT Gatewayอ่านบทความ →

📚 ดูบทความทั้งหมด →