SiamCafe · Blog
Model Registry GitOps Workflow — จัดการ ML Models ด้วย GitOps และ CI/CD
บทความ

Model Registry GitOps Workflow — จัดการ ML Models ด้วย GitOps และ CI/CD

เผยแพร่ 28 พฤษภาคม 2569

Model Registry คืออะไรและทำไมต้องใช้ GitOps

Model Registry เป็นระบบ central repository สำหรับจัดการ ML models ทำหน้าที่เก็บ model versions, metadata, metrics, artifacts และ deployment status ช่วยให้ทีม ML สามารถ track, compare และ manage models ได้อย่างเป็นระบบ

GitOps เป็นแนวปฏิบัติที่ใช้ Git เป็น single source of truth สำหรับ infrastructure และ application deployments ทุกการเปลี่ยนแปลงต้องผ่าน Git (Pull Request) ทำให้มี audit trail, review process และ rollback ง่าย

การรวม Model Registry กับ GitOps ทำให้การ deploy ML models มีความน่าเชื่อถือ ทุก model deployment ต้องผ่าน PR review มี automated tests ก่อน deploy สามารถ rollback ได้ง่ายโดย revert Git commit และมี complete history ของทุก deployment

เครื่องมือที่ใช้ในระบบนี้ได้แก่ MLflow Model Registry สำหรับ model versioning, GitHub/GitLab สำหรับ GitOps workflow, ArgoCD หรือ Flux สำหรับ Kubernetes GitOps, GitHub Actions สำหรับ CI/CD และ Kubernetes สำหรับ model serving infrastructure

ตั้งค่า MLflow Model Registry

ติดตั้งและตั้งค่า MLflow สำหรับ production

# ติดตั้ง MLflow
pip install mlflow[extras] psycopg2-binary boto3

# รัน MLflow Server (production setup)
mlflow server \
 --backend-store-uri postgresql://mlflow:password@db:5432/mlflow \
 --default-artifact-root s3://mlflow-artifacts/ \
 --host 0.0.0.0 \
 --port 5000

# Docker Compose สำหรับ MLflow
# docker-compose.yml
# services:
# mlflow:
# image: ghcr.io/mlflow/mlflow:latest
# ports: ["5000:5000"]
# environment:
# MLFLOW_TRACKING_URI: postgresql://mlflow:pass@db:5432/mlflow
# AWS_ACCESS_KEY_ID: 
# AWS_SECRET_ACCESS_KEY: 
# command: >
# mlflow server
# --backend-store-uri postgresql://mlflow:pass@db:5432/mlflow
# --default-artifact-root s3://mlflow-artifacts/
# --host 0.0.0.0
# db:
# image: postgres:15
# environment:
# POSTGRES_USER: mlflow
# POSTGRES_PASSWORD: pass
# POSTGRES_DB: mlflow
# volumes:
# - pgdata:/var/lib/postgresql/data
# volumes:
# pgdata:

# === Register Model ===
import mlflow
from mlflow.tracking import MlflowClient

mlflow.set_tracking_uri("http://mlflow:5000")
client = MlflowClient()

# Train and log model
with mlflow.start_run(run_name="xgboost_v2") as run:
 # ... training code ...
 mlflow.xgboost.log_model(model, "model")
 mlflow.log_metrics({"accuracy": 0.95, "f1": 0.92, "auc": 0.97})
 mlflow.log_params({"max_depth": 6, "n_estimators": 200})

# Register model
model_uri = f"runs:/{run.info.run_id}/model"
result = mlflow.register_model(model_uri, "fraud_detector")

# Transition to staging
client.transition_model_version_stage(
 name="fraud_detector",
 version=result.version,
 stage="Staging"
)

# Add description
client.update_model_version(
 name="fraud_detector",
 version=result.version,
 description="XGBoost v2: improved feature engineering, F1=0.92"
)

# List all versions
for mv in client.search_model_versions("name='fraud_detector'"):
 print(f"Version {mv.version}: stage={mv.current_stage}, run_id={mv.run_id}")

สร้าง GitOps Workflow สำหรับ Model Deployment

โครงสร้าง Git repository สำหรับ model deployments

# GitOps Repository Structure
# ml-deployments/
# ├── models/
# │ ├── fraud-detector/
# │ │ ├── staging/
# │ │ │ ├── kustomization.yaml
# │ │ │ ├── deployment.yaml
# │ │ │ └── service.yaml
# │ │ ├── production/
# │ │ │ ├── kustomization.yaml
# │ │ │ ├── deployment.yaml
# │ │ │ ├── service.yaml
# │ │ │ └── hpa.yaml
# │ │ └── model-config.yaml # model version & metadata
# │ └── recommender/
# │ ├── staging/
# │ └── production/
# ├── base/
# │ ├── deployment.yaml
# │ ├── service.yaml
# │ └── kustomization.yaml
# ├── scripts/
# │ ├── promote_model.py
# │ ├── validate_model.py
# │ └── rollback_model.py
# └── .github/
# └── workflows/
# ├── deploy-staging.yml
# ├── promote-production.yml
# └── validate-model.yml

# model-config.yaml — Model Version Config (source of truth)
# apiVersion: v1
# kind: ConfigMap
# metadata:
# name: fraud-detector-config
# data:
# model_name: fraud_detector
# model_version: "5"
# model_stage: Production
# mlflow_tracking_uri: http://mlflow:5000
# model_uri: models:/fraud_detector/5
# min_replicas: "3"
# max_replicas: "10"
# cpu_request: "500m"
# memory_request: "1Gi"

# production/deployment.yaml
# apiVersion: apps/v1
# kind: Deployment
# metadata:
# name: fraud-detector
# labels:
# app: fraud-detector
# model-version: "5"
# spec:
# replicas: 3
# selector:
# matchLabels:
# app: fraud-detector
# template:
# metadata:
# labels:
# app: fraud-detector
# model-version: "5"
# spec:
# containers:
# - name: model-server
# image: ml-serving:latest
# env:
# - name: MODEL_URI
# valueFrom:
# configMapKeyRef:
# name: fraud-detector-config
# key: model_uri
# - name: MLFLOW_TRACKING_URI
# valueFrom:
# configMapKeyRef:
# name: fraud-detector-config
# key: mlflow_tracking_uri
# ports:
# - containerPort: 8080
# resources:
# requests:
# cpu: 500m
# memory: 1Gi
# limits:
# cpu: 1000m
# memory: 2Gi
# readinessProbe:
# httpGet:
# path: /health
# port: 8080
# initialDelaySeconds: 30
# livenessProbe:
# httpGet:
# path: /health
# port: 8080
# initialDelaySeconds: 60

CI/CD Pipeline สำหรับ ML Models

GitHub Actions workflows สำหรับ model deployment

# .github/workflows/deploy-staging.yml
name: Deploy Model to Staging

on:
 push:
 paths:
 - 'models/*/staging/**'
 - 'models/*/model-config.yaml'
 branches: [main]

jobs:
 validate:
 runs-on: ubuntu-latest
 steps:
 - uses: actions/checkout@v4
 
 - name: Setup Python
 uses: actions/setup-python@v5
 with:
 python-version: '3.11'
 
 - name: Install dependencies
 run: pip install mlflow requests pyyaml
 
 - name: Validate model config
 run: python scripts/validate_model.py --env staging
 
 - name: Run model tests
 env:
 MLFLOW_TRACKING_URI: }
 run: |
 python scripts/test_model.py \
 --model-name fraud_detector \
 --min-accuracy 0.90 \
 --min-f1 0.85

 deploy-staging:
 needs: validate
 runs-on: ubuntu-latest
 steps:
 - uses: actions/checkout@v4
 
 - name: Setup kubectl
 uses: azure/setup-kubectl@v3
 
 - name: Configure kubeconfig
 run: echo "}" | base64 -d > ~/.kube/config
 
 - name: Deploy to staging
 run: |
 kubectl apply -k models/fraud-detector/staging/
 kubectl rollout status deployment/fraud-detector -n staging --timeout=300s
 
 - name: Run integration tests
 run: |
 STAGING_URL=$(kubectl get svc fraud-detector -n staging -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
 python scripts/integration_test.py --url "http://$STAGING_URL:8080"
 
 - name: Notify
 if: always()
 uses: slackapi/slack-github-action@v1
 with:
 payload: |
 {"text": "Staging deploy }: fraud-detector"}

# .github/workflows/promote-production.yml
name: Promote Model to Production

on:
 pull_request:
 types: [closed]
 paths:
 - 'models/*/production/**'
 branches: [main]

jobs:
 promote:
 if: github.event.pull_request.merged == true
 runs-on: ubuntu-latest
 environment: production
 steps:
 - uses: actions/checkout@v4
 
 - name: Validate production config
 run: python scripts/validate_model.py --env production
 
 - name: Blue-Green Deploy
 run: |
 # Deploy new version alongside old
 kubectl apply -k models/fraud-detector/production/
 kubectl rollout status deployment/fraud-detector -n production --timeout=600s
 
 # Run canary tests
 python scripts/canary_test.py --duration 300 --error-threshold 0.01
 
 - name: Update MLflow Stage
 env:
 MLFLOW_TRACKING_URI: }
 run: |
 python -c "
 from mlflow.tracking import MlflowClient
 import yaml
 
 with open('models/fraud-detector/model-config.yaml') as f:
 config = yaml.safe_load(f)
 
 version = config['data']['model_version']
 client = MlflowClient()
 client.transition_model_version_stage(
 name='fraud_detector', version=version, stage='Production'
 )
 print(f'Model v{version} promoted to Production')
 "

Model Promotion และ Approval Process

Script สำหรับ promote model ผ่าน GitOps

#!/usr/bin/env python3
# scripts/promote_model.py — Model Promotion via GitOps
import yaml
import subprocess
import json
from mlflow.tracking import MlflowClient
from datetime import datetime
import argparse
import os

class ModelPromoter:
 def __init__(self, mlflow_uri, repo_path="."):
 self.client = MlflowClient(tracking_uri=mlflow_uri)
 self.repo_path = repo_path
 
 def get_model_info(self, model_name, version):
 mv = self.client.get_model_version(model_name, version)
 run = self.client.get_run(mv.run_id)
 
 return {
 "name": model_name,
 "version": version,
 "stage": mv.current_stage,
 "run_id": mv.run_id,
 "metrics": run.data.metrics,
 "params": run.data.params,
 "created": mv.creation_timestamp,
 }
 
 def validate_for_promotion(self, model_name, version, target_env):
 info = self.get_model_info(model_name, version)
 
 thresholds = {
 "staging": {"accuracy": 0.85, "f1": 0.80},
 "production": {"accuracy": 0.90, "f1": 0.85},
 }
 
 required = thresholds.get(target_env, {})
 failures = []
 
 for metric, threshold in required.items():
 actual = info["metrics"].get(metric, 0)
 if actual < threshold:
 failures.append(f"{metric}: {actual:.4f} < {threshold}")
 
 if failures:
 print(f"Validation FAILED for {model_name} v{version}:")
 for f in failures:
 print(f" - {f}")
 return False
 
 print(f"Validation PASSED for {model_name} v{version}")
 return True
 
 def create_promotion_pr(self, model_name, version, target_env):
 if not self.validate_for_promotion(model_name, version, target_env):
 raise ValueError("Model validation failed")
 
 info = self.get_model_info(model_name, version)
 slug = model_name.replace("_", "-")
 branch = f"promote/{slug}-v{version}-{target_env}"
 
 subprocess.run(["git", "checkout", "-b", branch], cwd=self.repo_path)
 
 config_path = f"models/{slug}/model-config.yaml"
 config = {
 "apiVersion": "v1",
 "kind": "ConfigMap",
 "metadata": {"name": f"{slug}-config"},
 "data": {
 "model_name": model_name,
 "model_version": str(version),
 "model_stage": target_env.capitalize(),
 "mlflow_tracking_uri": os.environ.get("MLFLOW_TRACKING_URI", "http://mlflow:5000"),
 "model_uri": f"models:/{model_name}/{version}",
 },
 }
 
 with open(os.path.join(self.repo_path, config_path), "w") as f:
 yaml.dump(config, f, default_flow_style=False)
 
 deploy_path = f"models/{slug}/{target_env}/deployment.yaml"
 self._update_deployment_version(
 os.path.join(self.repo_path, deploy_path), version
 )
 
 subprocess.run(["git", "add", "."], cwd=self.repo_path)
 subprocess.run([
 "git", "commit", "-m",
 f"promote: {model_name} v{version} -> {target_env}\n\n"
 f"Metrics: {json.dumps(info['metrics'], indent=2)}"
 ], cwd=self.repo_path)
 subprocess.run(["git", "push", "origin", branch], cwd=self.repo_path)
 
 print(f"Branch '{branch}' pushed. Create PR to merge.")
 return branch
 
 def _update_deployment_version(self, path, version):
 try:
 with open(path) as f:
 deploy = yaml.safe_load(f)
 
 deploy["metadata"]["labels"]["model-version"] = str(version)
 deploy["spec"]["template"]["metadata"]["labels"]["model-version"] = str(version)
 
 with open(path, "w") as f:
 yaml.dump(deploy, f, default_flow_style=False)
 except FileNotFoundError:
 print(f"Deployment file not found: {path}")

if __name__ == "__main__":
 parser = argparse.ArgumentParser()
 parser.add_argument("--model", required=True)
 parser.add_argument("--version", required=True, type=int)
 parser.add_argument("--env", required=True, choices=["staging", "production"])
 args = parser.parse_args()
 
 promoter = ModelPromoter(os.environ["MLFLOW_TRACKING_URI"])
 promoter.create_promotion_pr(args.model, args.version, args.env)

Monitoring และ Rollback Strategy

ระบบ monitoring และ automatic rollback

#!/usr/bin/env python3
# scripts/monitor_and_rollback.py — Model Monitoring with Auto-rollback
import requests
import time
import subprocess
import yaml
import json
from datetime import datetime

class ModelMonitorGitOps:
 def __init__(self, serving_url, repo_path=".", check_interval=60):
 self.serving_url = serving_url
 self.repo_path = repo_path
 self.check_interval = check_interval
 self.baseline_metrics = {}
 
 def set_baseline(self, metrics):
 self.baseline_metrics = metrics
 print(f"Baseline set: {metrics}")
 
 def get_current_metrics(self):
 try:
 resp = requests.get(f"{self.serving_url}/metrics", timeout=10)
 return resp.json()
 except Exception as e:
 return {"error": str(e), "healthy": False}
 
 def check_health(self):
 try:
 resp = requests.get(f"{self.serving_url}/health", timeout=5)
 return resp.status_code == 200
 except Exception:
 return False
 
 def should_rollback(self, current_metrics):
 if not current_metrics.get("healthy", True):
 return True, "Service unhealthy"
 
 if "error_rate" in current_metrics:
 if current_metrics["error_rate"] > 0.05:
 return True, f"Error rate {current_metrics['error_rate']:.2%} > 5%"
 
 if "latency_p99" in current_metrics:
 if current_metrics["latency_p99"] > 2000:
 return True, f"P99 latency {current_metrics['latency_p99']}ms > 2000ms"
 
 if "accuracy" in current_metrics and "accuracy" in self.baseline_metrics:
 drop = self.baseline_metrics["accuracy"] - current_metrics["accuracy"]
 if drop > 0.05:
 return True, f"Accuracy dropped {drop:.2%}"
 
 return False, "OK"
 
 def rollback(self, model_name, reason):
 print(f"ROLLBACK triggered: {reason}")
 slug = model_name.replace("_", "-")
 
 result = subprocess.run(
 ["git", "log", "--oneline", "-5", f"models/{slug}/"],
 capture_output=True, text=True, cwd=self.repo_path
 )
 commits = result.stdout.strip().split("\n")
 
 if len(commits) < 2:
 print("No previous version to rollback to")
 return False
 
 previous_commit = commits[1].split()[0]
 
 subprocess.run([
 "git", "checkout", previous_commit, "--",
 f"models/{slug}/production/",
 f"models/{slug}/model-config.yaml",
 ], cwd=self.repo_path)
 
 subprocess.run(["git", "add", "."], cwd=self.repo_path)
 subprocess.run([
 "git", "commit", "-m",
 f"rollback: {model_name} — {reason}\n\nReverted to {previous_commit}"
 ], cwd=self.repo_path)
 subprocess.run(["git", "push", "origin", "main"], cwd=self.repo_path)
 
 subprocess.run([
 "kubectl", "apply", "-k", f"models/{slug}/production/"
 ], cwd=self.repo_path)
 
 print(f"Rollback complete: reverted to {previous_commit}")
 return True
 
 def monitor_loop(self, model_name, duration=3600):
 start = time.time()
 print(f"Monitoring {model_name} for {duration}s...")
 
 while time.time() - start < duration:
 if not self.check_health():
 self.rollback(model_name, "Health check failed")
 return False
 
 metrics = self.get_current_metrics()
 should_rb, reason = self.should_rollback(metrics)
 
 if should_rb:
 self.rollback(model_name, reason)
 return False
 
 print(f"[{datetime.now().strftime('%H:%M:%S')}] OK — {json.dumps(metrics)}")
 time.sleep(self.check_interval)
 
 print(f"Monitoring complete: {model_name} is stable")
 return True

if __name__ == "__main__":
 monitor = ModelMonitorGitOps("http://fraud-detector:8080")
 monitor.set_baseline({"accuracy": 0.95, "f1": 0.92})
 monitor.monitor_loop("fraud_detector", duration=1800)

FAQ คำถามที่พบบ่อย

Q: GitOps สำหรับ ML models ต่างจาก GitOps สำหรับ applications อย่างไร?

A: ML GitOps มีความซับซ้อนเพิ่มเติมคือ model artifacts มีขนาดใหญ่ไม่ควรเก็บใน Git โดยตรง (ใช้ Model Registry แทน), ต้องมี model validation ก่อน deploy (ตรวจ metrics, bias, drift), rollback อาจต้อง revert ทั้ง model version และ feature pipeline และ monitoring ต้องดูทั้ง system metrics และ model performance

Q: ArgoCD กับ Flux เลือกอันไหนสำหรับ ML GitOps?

A: ArgoCD มี web UI ที่ดี เหมาะสำหรับทีมที่ต้องการ visibility มี application management ที่ครบถ้วน Flux เป็น lightweight กว่า integrate กับ Git ได้ดีกว่า เหมาะสำหรับทีมที่ prefer CLI สำหรับ ML GitOps ทั้งสองใช้ได้ดี แต่ ArgoCD มี rollback UI ที่สะดวกกว่าสำหรับ non-technical stakeholders

Q: Model version ควรจัดการอย่างไร?

A: ใช้ semantic versioning สำหรับ models เช่น major.minor.patch โดย major เปลี่ยนเมื่อ architecture เปลี่ยน minor เปลี่ยนเมื่อ retrain ด้วย data ใหม่หรือ hyperparameter tuning patch เปลี่ยนเมื่อ fix bugs เก็บ version ใน MLflow Model Registry และอ้างอิงใน GitOps config files

Q: จะทำ canary deployment สำหรับ ML models อย่างไร?

A: ใช้ Kubernetes service mesh เช่น Istio หรือ Linkerd สร้าง traffic splitting ระหว่าง model version เก่ากับใหม่ เริ่มจาก 5% traffic ไปที่ model ใหม่ monitor metrics เช่น accuracy, latency, error rate ถ้าผ่านค่อยเพิ่มเป็น 25%, 50%, 100% ถ้าไม่ผ่าน rollback อัตโนมัติ ทั้งหมดจัดการผ่าน GitOps configs