Model Registry คืออะไรและทำไมต้องใช้ GitOps
Model Registry เป็นระบบ central repository สำหรับจัดการ ML models ทำหน้าที่เก็บ model versions, metadata, metrics, artifacts และ deployment status ช่วยให้ทีม ML สามารถ track, compare และ manage models ได้อย่างเป็นระบบ
GitOps เป็นแนวปฏิบัติที่ใช้ Git เป็น single source of truth สำหรับ infrastructure และ application deployments ทุกการเปลี่ยนแปลงต้องผ่าน Git (Pull Request) ทำให้มี audit trail, review process และ rollback ง่าย
การรวม Model Registry กับ GitOps ทำให้การ deploy ML models มีความน่าเชื่อถือ ทุก model deployment ต้องผ่าน PR review มี automated tests ก่อน deploy สามารถ rollback ได้ง่ายโดย revert Git commit และมี complete history ของทุก deployment
เครื่องมือที่ใช้ในระบบนี้ได้แก่ MLflow Model Registry สำหรับ model versioning, GitHub/GitLab สำหรับ GitOps workflow, ArgoCD หรือ Flux สำหรับ Kubernetes GitOps, GitHub Actions สำหรับ CI/CD และ Kubernetes สำหรับ model serving infrastructure
ตั้งค่า MLflow Model Registry
ติดตั้งและตั้งค่า MLflow สำหรับ production
# ติดตั้ง MLflow
pip install mlflow[extras] psycopg2-binary boto3
# รัน MLflow Server (production setup)
mlflow server \
--backend-store-uri postgresql://mlflow:password@db:5432/mlflow \
--default-artifact-root s3://mlflow-artifacts/ \
--host 0.0.0.0 \
--port 5000
# Docker Compose สำหรับ MLflow
# docker-compose.yml
# services:
# mlflow:
# image: ghcr.io/mlflow/mlflow:latest
# ports: ["5000:5000"]
# environment:
# MLFLOW_TRACKING_URI: postgresql://mlflow:pass@db:5432/mlflow
# AWS_ACCESS_KEY_ID:
# AWS_SECRET_ACCESS_KEY:
# command: >
# mlflow server
# --backend-store-uri postgresql://mlflow:pass@db:5432/mlflow
# --default-artifact-root s3://mlflow-artifacts/
# --host 0.0.0.0
# db:
# image: postgres:15
# environment:
# POSTGRES_USER: mlflow
# POSTGRES_PASSWORD: pass
# POSTGRES_DB: mlflow
# volumes:
# - pgdata:/var/lib/postgresql/data
# volumes:
# pgdata:
# === Register Model ===
import mlflow
from mlflow.tracking import MlflowClient
mlflow.set_tracking_uri("http://mlflow:5000")
client = MlflowClient()
# Train and log model
with mlflow.start_run(run_name="xgboost_v2") as run:
# ... training code ...
mlflow.xgboost.log_model(model, "model")
mlflow.log_metrics({"accuracy": 0.95, "f1": 0.92, "auc": 0.97})
mlflow.log_params({"max_depth": 6, "n_estimators": 200})
# Register model
model_uri = f"runs:/{run.info.run_id}/model"
result = mlflow.register_model(model_uri, "fraud_detector")
# Transition to staging
client.transition_model_version_stage(
name="fraud_detector",
version=result.version,
stage="Staging"
)
# Add description
client.update_model_version(
name="fraud_detector",
version=result.version,
description="XGBoost v2: improved feature engineering, F1=0.92"
)
# List all versions
for mv in client.search_model_versions("name='fraud_detector'"):
print(f"Version {mv.version}: stage={mv.current_stage}, run_id={mv.run_id}")
สร้าง GitOps Workflow สำหรับ Model Deployment
โครงสร้าง Git repository สำหรับ model deployments
# GitOps Repository Structure
# ml-deployments/
# ├── models/
# │ ├── fraud-detector/
# │ │ ├── staging/
# │ │ │ ├── kustomization.yaml
# │ │ │ ├── deployment.yaml
# │ │ │ └── service.yaml
# │ │ ├── production/
# │ │ │ ├── kustomization.yaml
# │ │ │ ├── deployment.yaml
# │ │ │ ├── service.yaml
# │ │ │ └── hpa.yaml
# │ │ └── model-config.yaml # model version & metadata
# │ └── recommender/
# │ ├── staging/
# │ └── production/
# ├── base/
# │ ├── deployment.yaml
# │ ├── service.yaml
# │ └── kustomization.yaml
# ├── scripts/
# │ ├── promote_model.py
# │ ├── validate_model.py
# │ └── rollback_model.py
# └── .github/
# └── workflows/
# ├── deploy-staging.yml
# ├── promote-production.yml
# └── validate-model.yml
# model-config.yaml — Model Version Config (source of truth)
# apiVersion: v1
# kind: ConfigMap
# metadata:
# name: fraud-detector-config
# data:
# model_name: fraud_detector
# model_version: "5"
# model_stage: Production
# mlflow_tracking_uri: http://mlflow:5000
# model_uri: models:/fraud_detector/5
# min_replicas: "3"
# max_replicas: "10"
# cpu_request: "500m"
# memory_request: "1Gi"
# production/deployment.yaml
# apiVersion: apps/v1
# kind: Deployment
# metadata:
# name: fraud-detector
# labels:
# app: fraud-detector
# model-version: "5"
# spec:
# replicas: 3
# selector:
# matchLabels:
# app: fraud-detector
# template:
# metadata:
# labels:
# app: fraud-detector
# model-version: "5"
# spec:
# containers:
# - name: model-server
# image: ml-serving:latest
# env:
# - name: MODEL_URI
# valueFrom:
# configMapKeyRef:
# name: fraud-detector-config
# key: model_uri
# - name: MLFLOW_TRACKING_URI
# valueFrom:
# configMapKeyRef:
# name: fraud-detector-config
# key: mlflow_tracking_uri
# ports:
# - containerPort: 8080
# resources:
# requests:
# cpu: 500m
# memory: 1Gi
# limits:
# cpu: 1000m
# memory: 2Gi
# readinessProbe:
# httpGet:
# path: /health
# port: 8080
# initialDelaySeconds: 30
# livenessProbe:
# httpGet:
# path: /health
# port: 8080
# initialDelaySeconds: 60
CI/CD Pipeline สำหรับ ML Models
GitHub Actions workflows สำหรับ model deployment
# .github/workflows/deploy-staging.yml
name: Deploy Model to Staging
on:
push:
paths:
- 'models/*/staging/**'
- 'models/*/model-config.yaml'
branches: [main]
jobs:
validate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Setup Python
uses: actions/setup-python@v5
with:
python-version: '3.11'
- name: Install dependencies
run: pip install mlflow requests pyyaml
- name: Validate model config
run: python scripts/validate_model.py --env staging
- name: Run model tests
env:
MLFLOW_TRACKING_URI: }
run: |
python scripts/test_model.py \
--model-name fraud_detector \
--min-accuracy 0.90 \
--min-f1 0.85
deploy-staging:
needs: validate
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Setup kubectl
uses: azure/setup-kubectl@v3
- name: Configure kubeconfig
run: echo "}" | base64 -d > ~/.kube/config
- name: Deploy to staging
run: |
kubectl apply -k models/fraud-detector/staging/
kubectl rollout status deployment/fraud-detector -n staging --timeout=300s
- name: Run integration tests
run: |
STAGING_URL=$(kubectl get svc fraud-detector -n staging -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
python scripts/integration_test.py --url "http://$STAGING_URL:8080"
- name: Notify
if: always()
uses: slackapi/slack-github-action@v1
with:
payload: |
{"text": "Staging deploy }: fraud-detector"}
# .github/workflows/promote-production.yml
name: Promote Model to Production
on:
pull_request:
types: [closed]
paths:
- 'models/*/production/**'
branches: [main]
jobs:
promote:
if: github.event.pull_request.merged == true
runs-on: ubuntu-latest
environment: production
steps:
- uses: actions/checkout@v4
- name: Validate production config
run: python scripts/validate_model.py --env production
- name: Blue-Green Deploy
run: |
# Deploy new version alongside old
kubectl apply -k models/fraud-detector/production/
kubectl rollout status deployment/fraud-detector -n production --timeout=600s
# Run canary tests
python scripts/canary_test.py --duration 300 --error-threshold 0.01
- name: Update MLflow Stage
env:
MLFLOW_TRACKING_URI: }
run: |
python -c "
from mlflow.tracking import MlflowClient
import yaml
with open('models/fraud-detector/model-config.yaml') as f:
config = yaml.safe_load(f)
version = config['data']['model_version']
client = MlflowClient()
client.transition_model_version_stage(
name='fraud_detector', version=version, stage='Production'
)
print(f'Model v{version} promoted to Production')
"
Model Promotion และ Approval Process
Script สำหรับ promote model ผ่าน GitOps
#!/usr/bin/env python3
# scripts/promote_model.py — Model Promotion via GitOps
import yaml
import subprocess
import json
from mlflow.tracking import MlflowClient
from datetime import datetime
import argparse
import os
class ModelPromoter:
def __init__(self, mlflow_uri, repo_path="."):
self.client = MlflowClient(tracking_uri=mlflow_uri)
self.repo_path = repo_path
def get_model_info(self, model_name, version):
mv = self.client.get_model_version(model_name, version)
run = self.client.get_run(mv.run_id)
return {
"name": model_name,
"version": version,
"stage": mv.current_stage,
"run_id": mv.run_id,
"metrics": run.data.metrics,
"params": run.data.params,
"created": mv.creation_timestamp,
}
def validate_for_promotion(self, model_name, version, target_env):
info = self.get_model_info(model_name, version)
thresholds = {
"staging": {"accuracy": 0.85, "f1": 0.80},
"production": {"accuracy": 0.90, "f1": 0.85},
}
required = thresholds.get(target_env, {})
failures = []
for metric, threshold in required.items():
actual = info["metrics"].get(metric, 0)
if actual < threshold:
failures.append(f"{metric}: {actual:.4f} < {threshold}")
if failures:
print(f"Validation FAILED for {model_name} v{version}:")
for f in failures:
print(f" - {f}")
return False
print(f"Validation PASSED for {model_name} v{version}")
return True
def create_promotion_pr(self, model_name, version, target_env):
if not self.validate_for_promotion(model_name, version, target_env):
raise ValueError("Model validation failed")
info = self.get_model_info(model_name, version)
slug = model_name.replace("_", "-")
branch = f"promote/{slug}-v{version}-{target_env}"
subprocess.run(["git", "checkout", "-b", branch], cwd=self.repo_path)
config_path = f"models/{slug}/model-config.yaml"
config = {
"apiVersion": "v1",
"kind": "ConfigMap",
"metadata": {"name": f"{slug}-config"},
"data": {
"model_name": model_name,
"model_version": str(version),
"model_stage": target_env.capitalize(),
"mlflow_tracking_uri": os.environ.get("MLFLOW_TRACKING_URI", "http://mlflow:5000"),
"model_uri": f"models:/{model_name}/{version}",
},
}
with open(os.path.join(self.repo_path, config_path), "w") as f:
yaml.dump(config, f, default_flow_style=False)
deploy_path = f"models/{slug}/{target_env}/deployment.yaml"
self._update_deployment_version(
os.path.join(self.repo_path, deploy_path), version
)
subprocess.run(["git", "add", "."], cwd=self.repo_path)
subprocess.run([
"git", "commit", "-m",
f"promote: {model_name} v{version} -> {target_env}\n\n"
f"Metrics: {json.dumps(info['metrics'], indent=2)}"
], cwd=self.repo_path)
subprocess.run(["git", "push", "origin", branch], cwd=self.repo_path)
print(f"Branch '{branch}' pushed. Create PR to merge.")
return branch
def _update_deployment_version(self, path, version):
try:
with open(path) as f:
deploy = yaml.safe_load(f)
deploy["metadata"]["labels"]["model-version"] = str(version)
deploy["spec"]["template"]["metadata"]["labels"]["model-version"] = str(version)
with open(path, "w") as f:
yaml.dump(deploy, f, default_flow_style=False)
except FileNotFoundError:
print(f"Deployment file not found: {path}")
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--model", required=True)
parser.add_argument("--version", required=True, type=int)
parser.add_argument("--env", required=True, choices=["staging", "production"])
args = parser.parse_args()
promoter = ModelPromoter(os.environ["MLFLOW_TRACKING_URI"])
promoter.create_promotion_pr(args.model, args.version, args.env)
Monitoring และ Rollback Strategy
ระบบ monitoring และ automatic rollback
#!/usr/bin/env python3
# scripts/monitor_and_rollback.py — Model Monitoring with Auto-rollback
import requests
import time
import subprocess
import yaml
import json
from datetime import datetime
class ModelMonitorGitOps:
def __init__(self, serving_url, repo_path=".", check_interval=60):
self.serving_url = serving_url
self.repo_path = repo_path
self.check_interval = check_interval
self.baseline_metrics = {}
def set_baseline(self, metrics):
self.baseline_metrics = metrics
print(f"Baseline set: {metrics}")
def get_current_metrics(self):
try:
resp = requests.get(f"{self.serving_url}/metrics", timeout=10)
return resp.json()
except Exception as e:
return {"error": str(e), "healthy": False}
def check_health(self):
try:
resp = requests.get(f"{self.serving_url}/health", timeout=5)
return resp.status_code == 200
except Exception:
return False
def should_rollback(self, current_metrics):
if not current_metrics.get("healthy", True):
return True, "Service unhealthy"
if "error_rate" in current_metrics:
if current_metrics["error_rate"] > 0.05:
return True, f"Error rate {current_metrics['error_rate']:.2%} > 5%"
if "latency_p99" in current_metrics:
if current_metrics["latency_p99"] > 2000:
return True, f"P99 latency {current_metrics['latency_p99']}ms > 2000ms"
if "accuracy" in current_metrics and "accuracy" in self.baseline_metrics:
drop = self.baseline_metrics["accuracy"] - current_metrics["accuracy"]
if drop > 0.05:
return True, f"Accuracy dropped {drop:.2%}"
return False, "OK"
def rollback(self, model_name, reason):
print(f"ROLLBACK triggered: {reason}")
slug = model_name.replace("_", "-")
result = subprocess.run(
["git", "log", "--oneline", "-5", f"models/{slug}/"],
capture_output=True, text=True, cwd=self.repo_path
)
commits = result.stdout.strip().split("\n")
if len(commits) < 2:
print("No previous version to rollback to")
return False
previous_commit = commits[1].split()[0]
subprocess.run([
"git", "checkout", previous_commit, "--",
f"models/{slug}/production/",
f"models/{slug}/model-config.yaml",
], cwd=self.repo_path)
subprocess.run(["git", "add", "."], cwd=self.repo_path)
subprocess.run([
"git", "commit", "-m",
f"rollback: {model_name} — {reason}\n\nReverted to {previous_commit}"
], cwd=self.repo_path)
subprocess.run(["git", "push", "origin", "main"], cwd=self.repo_path)
subprocess.run([
"kubectl", "apply", "-k", f"models/{slug}/production/"
], cwd=self.repo_path)
print(f"Rollback complete: reverted to {previous_commit}")
return True
def monitor_loop(self, model_name, duration=3600):
start = time.time()
print(f"Monitoring {model_name} for {duration}s...")
while time.time() - start < duration:
if not self.check_health():
self.rollback(model_name, "Health check failed")
return False
metrics = self.get_current_metrics()
should_rb, reason = self.should_rollback(metrics)
if should_rb:
self.rollback(model_name, reason)
return False
print(f"[{datetime.now().strftime('%H:%M:%S')}] OK — {json.dumps(metrics)}")
time.sleep(self.check_interval)
print(f"Monitoring complete: {model_name} is stable")
return True
if __name__ == "__main__":
monitor = ModelMonitorGitOps("http://fraud-detector:8080")
monitor.set_baseline({"accuracy": 0.95, "f1": 0.92})
monitor.monitor_loop("fraud_detector", duration=1800)
FAQ คำถามที่พบบ่อย
Q: GitOps สำหรับ ML models ต่างจาก GitOps สำหรับ applications อย่างไร?
A: ML GitOps มีความซับซ้อนเพิ่มเติมคือ model artifacts มีขนาดใหญ่ไม่ควรเก็บใน Git โดยตรง (ใช้ Model Registry แทน), ต้องมี model validation ก่อน deploy (ตรวจ metrics, bias, drift), rollback อาจต้อง revert ทั้ง model version และ feature pipeline และ monitoring ต้องดูทั้ง system metrics และ model performance
Q: ArgoCD กับ Flux เลือกอันไหนสำหรับ ML GitOps?
A: ArgoCD มี web UI ที่ดี เหมาะสำหรับทีมที่ต้องการ visibility มี application management ที่ครบถ้วน Flux เป็น lightweight กว่า integrate กับ Git ได้ดีกว่า เหมาะสำหรับทีมที่ prefer CLI สำหรับ ML GitOps ทั้งสองใช้ได้ดี แต่ ArgoCD มี rollback UI ที่สะดวกกว่าสำหรับ non-technical stakeholders
Q: Model version ควรจัดการอย่างไร?
A: ใช้ semantic versioning สำหรับ models เช่น major.minor.patch โดย major เปลี่ยนเมื่อ architecture เปลี่ยน minor เปลี่ยนเมื่อ retrain ด้วย data ใหม่หรือ hyperparameter tuning patch เปลี่ยนเมื่อ fix bugs เก็บ version ใน MLflow Model Registry และอ้างอิงใน GitOps config files
Q: จะทำ canary deployment สำหรับ ML models อย่างไร?
A: ใช้ Kubernetes service mesh เช่น Istio หรือ Linkerd สร้าง traffic splitting ระหว่าง model version เก่ากับใหม่ เริ่มจาก 5% traffic ไปที่ model ใหม่ monitor metrics เช่น accuracy, latency, error rate ถ้าผ่านค่อยเพิ่มเป็น 25%, 50%, 100% ถ้าไม่ผ่าน rollback อัตโนมัติ ทั้งหมดจัดการผ่าน GitOps configs
