Model Registry คืออะไร
Model Registry เป็นระบบจัดการ Machine Learning models แบบ centralized เก็บ model versions, metadata, artifacts, lineage และ deployment status ทำให้ทีม Data Science และ ML Engineers จัดการ ML lifecycle ได้อย่างเป็นระบบ ตั้งแต่ training ไปจนถึง production deployment
เครื่องมือ Model Registry ที่นิยมได้แก่ MLflow Model Registry (open source, most popular), Weights and Biases (W&B) Registry, Amazon SageMaker Model Registry, Google Vertex AI Model Registry, Azure ML Model Registry แต่ละตัวมีจุดเด่นต่างกัน MLflow เหมาะสำหรับเริ่มต้นเพราะ open source และ flexible
Chaos Engineering สำหรับ ML systems สำคัญมากเพราะ ML pipelines มีจุดที่อาจล้มเหลวหลายจุด ตั้งแต่ data ingestion, feature extraction, model serving, prediction caching จนถึง fallback mechanisms การทำ chaos experiments ช่วยค้นพบจุดอ่อนก่อนที่จะเกิดปัญหาจริง
ติดตั้ง Model Registry
Setup MLflow Model Registry
# === MLflow Model Registry Setup ===
# 1. Install MLflow
pip install mlflow boto3 psycopg2-binary
# 2. Start MLflow Server with PostgreSQL backend
cat > docker-compose.yml << 'EOF'
version: '3.8'
services:
mlflow:
image: ghcr.io/mlflow/mlflow:v2.16.0
ports:
- "5000:5000"
environment:
- MLFLOW_BACKEND_STORE_URI=postgresql://mlflow:mlflow_pass@postgres:5432/mlflow
- MLFLOW_DEFAULT_ARTIFACT_ROOT=s3://mlflow-artifacts/
- AWS_ACCESS_KEY_ID=
- AWS_SECRET_ACCESS_KEY=
command: >
mlflow server
--host 0.0.0.0
--port 5000
--backend-store-uri postgresql://mlflow:mlflow_pass@postgres:5432/mlflow
--default-artifact-root s3://mlflow-artifacts/
depends_on:
- postgres
postgres:
image: postgres:16
environment:
POSTGRES_DB: mlflow
POSTGRES_USER: mlflow
POSTGRES_PASSWORD: mlflow_pass
volumes:
- pg_data:/var/lib/postgresql/data
volumes:
pg_data:
EOF
docker-compose up -d
# 3. Register a Model
cat > register_model.py << 'PYEOF'
import mlflow
from mlflow.tracking import MlflowClient
mlflow.set_tracking_uri("http://localhost:5000")
client = MlflowClient()
# Log model during training
with mlflow.start_run(run_name="training-v1") as run:
# Train model (example)
mlflow.log_param("learning_rate", 0.001)
mlflow.log_param("epochs", 100)
mlflow.log_metric("accuracy", 0.95)
mlflow.log_metric("f1_score", 0.93)
# Register model
mlflow.sklearn.log_model(
sk_model=None, # your trained model
artifact_path="model",
registered_model_name="fraud-detection-model"
)
# Transition model stage
client.transition_model_version_stage(
name="fraud-detection-model",
version=1,
stage="Production"
)
print("Model registered and promoted to Production")
PYEOF
# 4. Serve Model
mlflow models serve -m "models:/fraud-detection-model/Production" -p 8001
echo "MLflow Model Registry configured"
Chaos Engineering สำหรับ ML Systems
Chaos Engineering concepts สำหรับ ML
# === Chaos Engineering for ML ===
# 1. ML System Failure Points
# ===================================
# Data Layer:
# - Data source unavailable
# - Data schema changed unexpectedly
# - Data quality degradation (drift)
# - Feature store latency spike
#
# Model Layer:
# - Model registry unreachable
# - Model artifact corrupted
# - Model version mismatch
# - OOM during inference (large batch)
#
# Serving Layer:
# - Model server crash
# - GPU failure
# - High latency under load
# - Prediction cache miss storm
#
# Pipeline Layer:
# - Training pipeline failure
# - Feature pipeline delay
# - Orchestrator (Airflow) down
# - Storage full
# 2. Chaos Experiment Types
# ===================================
# Infrastructure chaos:
# - Kill model server pod
# - Network partition between services
# - CPU/memory stress on inference nodes
# - Disk I/O latency injection
#
# Application chaos:
# - Inject invalid model version
# - Corrupt feature values
# - Simulate model registry timeout
# - Send malformed prediction requests
#
# Data chaos:
# - Inject data drift
# - Remove feature columns
# - Delay data pipeline
# - Corrupt training data
# 3. Steady State Hypothesis
# ===================================
# Before chaos: define what "normal" looks like
# Metrics to monitor:
# - Prediction latency p99 < 100ms
# - Error rate < 0.1%
# - Model serving throughput > 1000 RPS
# - Fallback activation rate < 5%
# - Feature freshness < 5 minutes
# 4. Tools
# ===================================
# Chaos Mesh (Kubernetes): pod kill, network chaos, IO chaos
# Litmus Chaos: workflow-based chaos experiments
# Gremlin: commercial, easy to use
# Toxiproxy: network-level chaos (latency, packet loss)
# Custom scripts: application-level chaos
echo "Chaos engineering concepts"
เขียน Chaos Experiments
Implement chaos experiments
#!/usr/bin/env python3
# chaos_experiments.py — ML Chaos Experiments
import json
import random
import logging
from datetime import datetime
from typing import Dict, List
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger("chaos")
class MLChaosExperiment:
def __init__(self, target_service):
self.target = target_service
self.results = []
def define_experiment(self, name, hypothesis, method, rollback):
return {
"name": name,
"target": self.target,
"steady_state_hypothesis": hypothesis,
"method": method,
"rollback": rollback,
"created_at": datetime.utcnow().isoformat(),
}
def model_server_crash(self):
"""Experiment: Kill model serving pod"""
return self.define_experiment(
name="model-server-crash",
hypothesis={
"title": "System handles model server failure gracefully",
"probes": [
{"type": "http", "url": "/api/predict", "expect_status": 200, "timeout_ms": 200},
{"type": "metric", "query": "error_rate", "expect_below": 0.01},
],
},
method={
"type": "pod_kill",
"target": "deployment/model-server",
"namespace": "ml-serving",
"action": "kill 1 random pod",
"duration_seconds": 60,
},
rollback={
"action": "Kubernetes auto-restarts pod",
"verify": "Check pod count restored and predictions working",
},
)
def model_registry_timeout(self):
"""Experiment: Model registry becomes slow"""
return self.define_experiment(
name="registry-timeout",
hypothesis={
"title": "System uses cached model when registry is slow",
"probes": [
{"type": "http", "url": "/api/predict", "expect_status": 200},
{"type": "metric", "query": "fallback_rate", "expect_below": 0.05},
],
},
method={
"type": "network_delay",
"target": "service/mlflow",
"delay_ms": 5000,
"duration_seconds": 120,
},
rollback={"action": "Remove network delay rule"},
)
def data_drift_injection(self):
"""Experiment: Inject data drift into feature pipeline"""
return self.define_experiment(
name="data-drift-injection",
hypothesis={
"title": "System detects and handles data drift",
"probes": [
{"type": "metric", "query": "drift_detected", "expect": True},
{"type": "metric", "query": "prediction_quality", "expect_above": 0.80},
{"type": "alert", "name": "DataDriftAlert", "expect_fired": True},
],
},
method={
"type": "data_mutation",
"target": "feature-pipeline",
"mutation": "Shift numerical features by 2 standard deviations",
"affected_features": ["amount", "frequency", "recency"],
"duration_seconds": 300,
},
rollback={"action": "Restore original feature pipeline"},
)
def run_experiment(self, experiment):
"""Simulate running a chaos experiment"""
success = random.random() > 0.3 # 70% chance system handles it
result = {
"experiment": experiment["name"],
"started_at": datetime.utcnow().isoformat(),
"status": "passed" if success else "failed",
"findings": [],
}
if not success:
result["findings"] = [
"System did not fallback gracefully",
f"Error rate exceeded threshold during {experiment['method']['type']}",
"Recommendation: Implement circuit breaker for model registry",
]
self.results.append(result)
return result
chaos = MLChaosExperiment("ml-serving")
exp1 = chaos.model_server_crash()
print("Experiment 1:", json.dumps(exp1["method"], indent=2))
exp2 = chaos.model_registry_timeout()
result = chaos.run_experiment(exp2)
print("\nResult:", json.dumps(result, indent=2))
exp3 = chaos.data_drift_injection()
print("\nDrift Experiment:", json.dumps(exp3["method"], indent=2))
Resilience Testing Pipeline
สร้าง pipeline สำหรับ resilience testing
# === Resilience Testing Pipeline ===
# 1. Chaos Mesh Experiment (Kubernetes)
cat > chaos/pod-kill.yaml << 'EOF'
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
name: model-server-pod-kill
namespace: ml-serving
spec:
action: pod-kill
mode: one
selector:
namespaces:
- ml-serving
labelSelectors:
app: model-server
duration: "60s"
scheduler:
cron: "@every 24h"
EOF
# 2. Network Chaos
cat > chaos/network-delay.yaml << 'EOF'
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: registry-network-delay
namespace: ml-serving
spec:
action: delay
mode: all
selector:
namespaces:
- ml-serving
labelSelectors:
app: mlflow-registry
delay:
latency: "3000ms"
jitter: "1000ms"
duration: "120s"
EOF
# 3. IO Chaos (Disk latency)
cat > chaos/io-stress.yaml << 'EOF'
apiVersion: chaos-mesh.org/v1alpha1
kind: IOChaos
metadata:
name: model-storage-io-delay
namespace: ml-serving
spec:
action: latency
mode: one
selector:
labelSelectors:
app: model-server
volumePath: /models
delay: "500ms"
duration: "120s"
EOF
# 4. Apply experiments
kubectl apply -f chaos/pod-kill.yaml
kubectl apply -f chaos/network-delay.yaml
# 5. Monitor during chaos
# Watch metrics:
# - kubectl top pods -n ml-serving
# - linkerd viz stat deploy -n ml-serving
# - curl http://model-server/health
# - curl http://model-server/metrics
# 6. CI/CD Integration
# Run chaos tests as part of staging deployment:
# deploy to staging → run chaos experiments → verify steady state → promote to production
# 7. Game Day Checklist
# ===================================
# [ ] Notify team about game day
# [ ] Verify monitoring dashboards ready
# [ ] Confirm rollback procedures documented
# [ ] Run experiments in staging first
# [ ] Start with smallest blast radius
# [ ] Gradually increase scope
# [ ] Document all findings
# [ ] Create action items for failures
# [ ] Schedule follow-up to verify fixes
echo "Resilience testing pipeline configured"
Monitoring และ Recovery
Monitor ML systems during chaos
#!/usr/bin/env python3
# chaos_monitor.py — Chaos Experiment Monitoring
import json
import logging
from datetime import datetime
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger("monitor")
class ChaosMonitor:
def __init__(self):
self.experiments = []
def steady_state_check(self):
return {
"timestamp": datetime.utcnow().isoformat(),
"metrics": {
"prediction_latency_p99_ms": 85,
"prediction_error_rate": 0.002,
"model_server_rps": 1200,
"fallback_rate": 0.01,
"model_version": "v3.2.1",
"feature_freshness_min": 2,
"gpu_utilization_pct": 65,
},
"thresholds": {
"prediction_latency_p99_ms": {"max": 200, "status": "ok"},
"prediction_error_rate": {"max": 0.01, "status": "ok"},
"model_server_rps": {"min": 500, "status": "ok"},
"fallback_rate": {"max": 0.05, "status": "ok"},
},
"overall": "healthy",
}
def recovery_playbook(self):
return {
"model_server_down": {
"auto_recovery": [
"Kubernetes restarts pod automatically",
"Load balancer routes to healthy pods",
"Prediction cache serves recent results",
],
"manual_steps": [
"Check pod logs: kubectl logs -n ml-serving deploy/model-server",
"Check events: kubectl get events -n ml-serving",
"Scale up if needed: kubectl scale deploy/model-server --replicas=5",
"Verify predictions: curl http://model-server/api/predict",
],
"escalation": "Page ML on-call if not recovered in 5 minutes",
},
"model_registry_down": {
"auto_recovery": [
"Model server uses locally cached model",
"Circuit breaker prevents cascade failure",
],
"manual_steps": [
"Check MLflow status: curl http://mlflow:5000/health",
"Check PostgreSQL: kubectl exec -it postgres -- pg_isready",
"Restart if needed: kubectl rollout restart deploy/mlflow",
],
},
"data_drift_detected": {
"auto_recovery": [
"Alert sent to data team",
"Feature pipeline paused",
"Serving continues with last known good model",
],
"manual_steps": [
"Investigate drift source",
"Validate data quality",
"Retrain model if needed",
"A/B test new model before full deployment",
],
},
}
monitor = ChaosMonitor()
steady = monitor.steady_state_check()
print("Steady State:", json.dumps(steady["metrics"], indent=2))
playbook = monitor.recovery_playbook()
print("\nRecovery:", json.dumps(playbook["model_server_down"]["auto_recovery"], indent=2))
FAQ คำถามที่พบบ่อย
Q: Chaos Engineering ปลอดภัยหรือเปล่า?
A: ปลอดภัยถ้าทำถูกวิธี เริ่มจาก staging environment ก่อนเสมอ ใช้ blast radius เล็กที่สุด (เช่น kill 1 pod ไม่ใช่ทั้ง deployment) มี rollback plan พร้อม มี monitoring ดูผลกระทบ real-time ทำใน business hours ที่ทีมพร้อม ค่อยๆ เพิ่ม scope เมื่อมั่นใจ ไม่ทำใน production จนกว่าจะผ่าน staging ทุก experiment Netflix, Google, Amazon ทำ chaos engineering ใน production ทุกวัน แต่พวกเขามี mature observability และ rollback systems
Q: Model Registry ควรมี features อะไรบ้าง?
A: Must-have Model versioning (เก็บทุก version), Stage management (staging, production, archived), Metadata tracking (metrics, parameters, training data), Artifact storage (model files, configs), API สำหรับ programmatic access Nice-to-have Model lineage (data → training → model), A/B testing integration, Approval workflow, Automated deployment trigger, Model comparison dashboard สำหรับเริ่มต้น MLflow ฟรีและมี features ครบ สำหรับ enterprise อาจต้อง managed service เช่น SageMaker หรือ Vertex AI
Q: Chaos experiment ควรทำบ่อยแค่ไหน?
A: ขึ้นกับ maturity ของระบบ เริ่มต้น ทำ quarterly (ทุก 3 เดือน) เป็น game day event ทีมร่วมกัน Intermediate ทำ monthly เป็น automated experiments ใน staging Advanced ทำ weekly หรือ continuous ใน production (automated) ทุกครั้งที่ deploy version ใหม่ ควรรัน chaos experiments ใน staging ก่อน promote สำหรับ ML systems ที่มีการ retrain model บ่อย ควรทำ chaos test ทุกครั้งที่ deploy model version ใหม่
Q: Fallback strategy สำหรับ ML serving ทำอย่างไร?
A: ควรมีหลายระดับ Level 1 ใช้ cached predictions สำหรับ requests ที่เคย predict แล้ว (Redis cache), Level 2 ใช้ simpler model ที่ inference เร็วกว่า (เช่น logistic regression แทน deep learning), Level 3 ใช้ rule-based system (business rules ที่ไม่ต้องใช้ ML), Level 4 return default/safe values ที่ไม่ก่อให้เกิดความเสียหาย ตั้ง circuit breaker ที่ model serving layer ถ้า error rate สูงให้ switch ไป fallback อัตโนมัติ monitor fallback rate ถ้าสูงกว่าปกติแสดงว่ามีปัญหา
