SiamCafe.net Blog
Technology

Model Registry GitOps Workflow — จัดการ ML Models ด้วย GitOps และ CI/CD

model registry gitops workflow
Model Registry GitOps Workflow | SiamCafe Blog
2026-01-08· อ. บอม — SiamCafe.net· 1,564 คำ

Model Registry คืออะไรและทำไมต้องใช้ GitOps

Model Registry เป็นระบบ central repository สำหรับจัดการ ML models ทำหน้าที่เก็บ model versions, metadata, metrics, artifacts และ deployment status ช่วยให้ทีม ML สามารถ track, compare และ manage models ได้อย่างเป็นระบบ

GitOps เป็นแนวปฏิบัติที่ใช้ Git เป็น single source of truth สำหรับ infrastructure และ application deployments ทุกการเปลี่ยนแปลงต้องผ่าน Git (Pull Request) ทำให้มี audit trail, review process และ rollback ง่าย

การรวม Model Registry กับ GitOps ทำให้การ deploy ML models มีความน่าเชื่อถือ ทุก model deployment ต้องผ่าน PR review มี automated tests ก่อน deploy สามารถ rollback ได้ง่ายโดย revert Git commit และมี complete history ของทุก deployment

เครื่องมือที่ใช้ในระบบนี้ได้แก่ MLflow Model Registry สำหรับ model versioning, GitHub/GitLab สำหรับ GitOps workflow, ArgoCD หรือ Flux สำหรับ Kubernetes GitOps, GitHub Actions สำหรับ CI/CD และ Kubernetes สำหรับ model serving infrastructure

ตั้งค่า MLflow Model Registry

ติดตั้งและตั้งค่า MLflow สำหรับ production

# ติดตั้ง MLflow
pip install mlflow[extras] psycopg2-binary boto3

# รัน MLflow Server (production setup)
mlflow server \
  --backend-store-uri postgresql://mlflow:password@db:5432/mlflow \
  --default-artifact-root s3://mlflow-artifacts/ \
  --host 0.0.0.0 \
  --port 5000

# Docker Compose สำหรับ MLflow
# docker-compose.yml
# services:
#   mlflow:
#     image: ghcr.io/mlflow/mlflow:latest
#     ports: ["5000:5000"]
#     environment:
#       MLFLOW_TRACKING_URI: postgresql://mlflow:pass@db:5432/mlflow
#       AWS_ACCESS_KEY_ID: 
#       AWS_SECRET_ACCESS_KEY: 
#     command: >
#       mlflow server
#       --backend-store-uri postgresql://mlflow:pass@db:5432/mlflow
#       --default-artifact-root s3://mlflow-artifacts/
#       --host 0.0.0.0
#   db:
#     image: postgres:15
#     environment:
#       POSTGRES_USER: mlflow
#       POSTGRES_PASSWORD: pass
#       POSTGRES_DB: mlflow
#     volumes:
#       - pgdata:/var/lib/postgresql/data
# volumes:
#   pgdata:

# === Register Model ===
import mlflow
from mlflow.tracking import MlflowClient

mlflow.set_tracking_uri("http://mlflow:5000")
client = MlflowClient()

# Train and log model
with mlflow.start_run(run_name="xgboost_v2") as run:
    # ... training code ...
    mlflow.xgboost.log_model(model, "model")
    mlflow.log_metrics({"accuracy": 0.95, "f1": 0.92, "auc": 0.97})
    mlflow.log_params({"max_depth": 6, "n_estimators": 200})

# Register model
model_uri = f"runs:/{run.info.run_id}/model"
result = mlflow.register_model(model_uri, "fraud_detector")

# Transition to staging
client.transition_model_version_stage(
    name="fraud_detector",
    version=result.version,
    stage="Staging"
)

# Add description
client.update_model_version(
    name="fraud_detector",
    version=result.version,
    description="XGBoost v2: improved feature engineering, F1=0.92"
)

# List all versions
for mv in client.search_model_versions("name='fraud_detector'"):
    print(f"Version {mv.version}: stage={mv.current_stage}, run_id={mv.run_id}")

สร้าง GitOps Workflow สำหรับ Model Deployment

โครงสร้าง Git repository สำหรับ model deployments

# GitOps Repository Structure
# ml-deployments/
# ├── models/
# │   ├── fraud-detector/
# │   │   ├── staging/
# │   │   │   ├── kustomization.yaml
# │   │   │   ├── deployment.yaml
# │   │   │   └── service.yaml
# │   │   ├── production/
# │   │   │   ├── kustomization.yaml
# │   │   │   ├── deployment.yaml
# │   │   │   ├── service.yaml
# │   │   │   └── hpa.yaml
# │   │   └── model-config.yaml    # model version & metadata
# │   └── recommender/
# │       ├── staging/
# │       └── production/
# ├── base/
# │   ├── deployment.yaml
# │   ├── service.yaml
# │   └── kustomization.yaml
# ├── scripts/
# │   ├── promote_model.py
# │   ├── validate_model.py
# │   └── rollback_model.py
# └── .github/
#     └── workflows/
#         ├── deploy-staging.yml
#         ├── promote-production.yml
#         └── validate-model.yml

# model-config.yaml — Model Version Config (source of truth)
# apiVersion: v1
# kind: ConfigMap
# metadata:
#   name: fraud-detector-config
# data:
#   model_name: fraud_detector
#   model_version: "5"
#   model_stage: Production
#   mlflow_tracking_uri: http://mlflow:5000
#   model_uri: models:/fraud_detector/5
#   min_replicas: "3"
#   max_replicas: "10"
#   cpu_request: "500m"
#   memory_request: "1Gi"

# production/deployment.yaml
# apiVersion: apps/v1
# kind: Deployment
# metadata:
#   name: fraud-detector
#   labels:
#     app: fraud-detector
#     model-version: "5"
# spec:
#   replicas: 3
#   selector:
#     matchLabels:
#       app: fraud-detector
#   template:
#     metadata:
#       labels:
#         app: fraud-detector
#         model-version: "5"
#     spec:
#       containers:
#       - name: model-server
#         image: ml-serving:latest
#         env:
#         - name: MODEL_URI
#           valueFrom:
#             configMapKeyRef:
#               name: fraud-detector-config
#               key: model_uri
#         - name: MLFLOW_TRACKING_URI
#           valueFrom:
#             configMapKeyRef:
#               name: fraud-detector-config
#               key: mlflow_tracking_uri
#         ports:
#         - containerPort: 8080
#         resources:
#           requests:
#             cpu: 500m
#             memory: 1Gi
#           limits:
#             cpu: 1000m
#             memory: 2Gi
#         readinessProbe:
#           httpGet:
#             path: /health
#             port: 8080
#           initialDelaySeconds: 30
#         livenessProbe:
#           httpGet:
#             path: /health
#             port: 8080
#           initialDelaySeconds: 60

CI/CD Pipeline สำหรับ ML Models

GitHub Actions workflows สำหรับ model deployment

# .github/workflows/deploy-staging.yml
name: Deploy Model to Staging

on:
  push:
    paths:
      - 'models/*/staging/**'
      - 'models/*/model-config.yaml'
    branches: [main]

jobs:
  validate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      
      - name: Setup Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.11'
      
      - name: Install dependencies
        run: pip install mlflow requests pyyaml
      
      - name: Validate model config
        run: python scripts/validate_model.py --env staging
      
      - name: Run model tests
        env:
          MLFLOW_TRACKING_URI: }
        run: |
          python scripts/test_model.py \
            --model-name fraud_detector \
            --min-accuracy 0.90 \
            --min-f1 0.85

  deploy-staging:
    needs: validate
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      
      - name: Setup kubectl
        uses: azure/setup-kubectl@v3
      
      - name: Configure kubeconfig
        run: echo "}" | base64 -d > ~/.kube/config
      
      - name: Deploy to staging
        run: |
          kubectl apply -k models/fraud-detector/staging/
          kubectl rollout status deployment/fraud-detector -n staging --timeout=300s
      
      - name: Run integration tests
        run: |
          STAGING_URL=$(kubectl get svc fraud-detector -n staging -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
          python scripts/integration_test.py --url "http://$STAGING_URL:8080"
      
      - name: Notify
        if: always()
        uses: slackapi/slack-github-action@v1
        with:
          payload: |
            {"text": "Staging deploy }: fraud-detector"}

# .github/workflows/promote-production.yml
name: Promote Model to Production

on:
  pull_request:
    types: [closed]
    paths:
      - 'models/*/production/**'
    branches: [main]

jobs:
  promote:
    if: github.event.pull_request.merged == true
    runs-on: ubuntu-latest
    environment: production
    steps:
      - uses: actions/checkout@v4
      
      - name: Validate production config
        run: python scripts/validate_model.py --env production
      
      - name: Blue-Green Deploy
        run: |
          # Deploy new version alongside old
          kubectl apply -k models/fraud-detector/production/
          kubectl rollout status deployment/fraud-detector -n production --timeout=600s
          
          # Run canary tests
          python scripts/canary_test.py --duration 300 --error-threshold 0.01
      
      - name: Update MLflow Stage
        env:
          MLFLOW_TRACKING_URI: }
        run: |
          python -c "
          from mlflow.tracking import MlflowClient
          import yaml
          
          with open('models/fraud-detector/model-config.yaml') as f:
              config = yaml.safe_load(f)
          
          version = config['data']['model_version']
          client = MlflowClient()
          client.transition_model_version_stage(
              name='fraud_detector', version=version, stage='Production'
          )
          print(f'Model v{version} promoted to Production')
          "

Model Promotion และ Approval Process

Script สำหรับ promote model ผ่าน GitOps

#!/usr/bin/env python3
# scripts/promote_model.py — Model Promotion via GitOps
import yaml
import subprocess
import json
from mlflow.tracking import MlflowClient
from datetime import datetime
import argparse
import os

class ModelPromoter:
    def __init__(self, mlflow_uri, repo_path="."):
        self.client = MlflowClient(tracking_uri=mlflow_uri)
        self.repo_path = repo_path
    
    def get_model_info(self, model_name, version):
        mv = self.client.get_model_version(model_name, version)
        run = self.client.get_run(mv.run_id)
        
        return {
            "name": model_name,
            "version": version,
            "stage": mv.current_stage,
            "run_id": mv.run_id,
            "metrics": run.data.metrics,
            "params": run.data.params,
            "created": mv.creation_timestamp,
        }
    
    def validate_for_promotion(self, model_name, version, target_env):
        info = self.get_model_info(model_name, version)
        
        thresholds = {
            "staging": {"accuracy": 0.85, "f1": 0.80},
            "production": {"accuracy": 0.90, "f1": 0.85},
        }
        
        required = thresholds.get(target_env, {})
        failures = []
        
        for metric, threshold in required.items():
            actual = info["metrics"].get(metric, 0)
            if actual < threshold:
                failures.append(f"{metric}: {actual:.4f} < {threshold}")
        
        if failures:
            print(f"Validation FAILED for {model_name} v{version}:")
            for f in failures:
                print(f"  - {f}")
            return False
        
        print(f"Validation PASSED for {model_name} v{version}")
        return True
    
    def create_promotion_pr(self, model_name, version, target_env):
        if not self.validate_for_promotion(model_name, version, target_env):
            raise ValueError("Model validation failed")
        
        info = self.get_model_info(model_name, version)
        slug = model_name.replace("_", "-")
        branch = f"promote/{slug}-v{version}-{target_env}"
        
        subprocess.run(["git", "checkout", "-b", branch], cwd=self.repo_path)
        
        config_path = f"models/{slug}/model-config.yaml"
        config = {
            "apiVersion": "v1",
            "kind": "ConfigMap",
            "metadata": {"name": f"{slug}-config"},
            "data": {
                "model_name": model_name,
                "model_version": str(version),
                "model_stage": target_env.capitalize(),
                "mlflow_tracking_uri": os.environ.get("MLFLOW_TRACKING_URI", "http://mlflow:5000"),
                "model_uri": f"models:/{model_name}/{version}",
            },
        }
        
        with open(os.path.join(self.repo_path, config_path), "w") as f:
            yaml.dump(config, f, default_flow_style=False)
        
        deploy_path = f"models/{slug}/{target_env}/deployment.yaml"
        self._update_deployment_version(
            os.path.join(self.repo_path, deploy_path), version
        )
        
        subprocess.run(["git", "add", "."], cwd=self.repo_path)
        subprocess.run([
            "git", "commit", "-m",
            f"promote: {model_name} v{version} -> {target_env}\n\n"
            f"Metrics: {json.dumps(info['metrics'], indent=2)}"
        ], cwd=self.repo_path)
        subprocess.run(["git", "push", "origin", branch], cwd=self.repo_path)
        
        print(f"Branch '{branch}' pushed. Create PR to merge.")
        return branch
    
    def _update_deployment_version(self, path, version):
        try:
            with open(path) as f:
                deploy = yaml.safe_load(f)
            
            deploy["metadata"]["labels"]["model-version"] = str(version)
            deploy["spec"]["template"]["metadata"]["labels"]["model-version"] = str(version)
            
            with open(path, "w") as f:
                yaml.dump(deploy, f, default_flow_style=False)
        except FileNotFoundError:
            print(f"Deployment file not found: {path}")

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--model", required=True)
    parser.add_argument("--version", required=True, type=int)
    parser.add_argument("--env", required=True, choices=["staging", "production"])
    args = parser.parse_args()
    
    promoter = ModelPromoter(os.environ["MLFLOW_TRACKING_URI"])
    promoter.create_promotion_pr(args.model, args.version, args.env)

Monitoring และ Rollback Strategy

ระบบ monitoring และ automatic rollback

#!/usr/bin/env python3
# scripts/monitor_and_rollback.py — Model Monitoring with Auto-rollback
import requests
import time
import subprocess
import yaml
import json
from datetime import datetime

class ModelMonitorGitOps:
    def __init__(self, serving_url, repo_path=".", check_interval=60):
        self.serving_url = serving_url
        self.repo_path = repo_path
        self.check_interval = check_interval
        self.baseline_metrics = {}
    
    def set_baseline(self, metrics):
        self.baseline_metrics = metrics
        print(f"Baseline set: {metrics}")
    
    def get_current_metrics(self):
        try:
            resp = requests.get(f"{self.serving_url}/metrics", timeout=10)
            return resp.json()
        except Exception as e:
            return {"error": str(e), "healthy": False}
    
    def check_health(self):
        try:
            resp = requests.get(f"{self.serving_url}/health", timeout=5)
            return resp.status_code == 200
        except Exception:
            return False
    
    def should_rollback(self, current_metrics):
        if not current_metrics.get("healthy", True):
            return True, "Service unhealthy"
        
        if "error_rate" in current_metrics:
            if current_metrics["error_rate"] > 0.05:
                return True, f"Error rate {current_metrics['error_rate']:.2%} > 5%"
        
        if "latency_p99" in current_metrics:
            if current_metrics["latency_p99"] > 2000:
                return True, f"P99 latency {current_metrics['latency_p99']}ms > 2000ms"
        
        if "accuracy" in current_metrics and "accuracy" in self.baseline_metrics:
            drop = self.baseline_metrics["accuracy"] - current_metrics["accuracy"]
            if drop > 0.05:
                return True, f"Accuracy dropped {drop:.2%}"
        
        return False, "OK"
    
    def rollback(self, model_name, reason):
        print(f"ROLLBACK triggered: {reason}")
        slug = model_name.replace("_", "-")
        
        result = subprocess.run(
            ["git", "log", "--oneline", "-5", f"models/{slug}/"],
            capture_output=True, text=True, cwd=self.repo_path
        )
        commits = result.stdout.strip().split("\n")
        
        if len(commits) < 2:
            print("No previous version to rollback to")
            return False
        
        previous_commit = commits[1].split()[0]
        
        subprocess.run([
            "git", "checkout", previous_commit, "--",
            f"models/{slug}/production/",
            f"models/{slug}/model-config.yaml",
        ], cwd=self.repo_path)
        
        subprocess.run(["git", "add", "."], cwd=self.repo_path)
        subprocess.run([
            "git", "commit", "-m",
            f"rollback: {model_name} — {reason}\n\nReverted to {previous_commit}"
        ], cwd=self.repo_path)
        subprocess.run(["git", "push", "origin", "main"], cwd=self.repo_path)
        
        subprocess.run([
            "kubectl", "apply", "-k", f"models/{slug}/production/"
        ], cwd=self.repo_path)
        
        print(f"Rollback complete: reverted to {previous_commit}")
        return True
    
    def monitor_loop(self, model_name, duration=3600):
        start = time.time()
        print(f"Monitoring {model_name} for {duration}s...")
        
        while time.time() - start < duration:
            if not self.check_health():
                self.rollback(model_name, "Health check failed")
                return False
            
            metrics = self.get_current_metrics()
            should_rb, reason = self.should_rollback(metrics)
            
            if should_rb:
                self.rollback(model_name, reason)
                return False
            
            print(f"[{datetime.now().strftime('%H:%M:%S')}] OK — {json.dumps(metrics)}")
            time.sleep(self.check_interval)
        
        print(f"Monitoring complete: {model_name} is stable")
        return True

if __name__ == "__main__":
    monitor = ModelMonitorGitOps("http://fraud-detector:8080")
    monitor.set_baseline({"accuracy": 0.95, "f1": 0.92})
    monitor.monitor_loop("fraud_detector", duration=1800)

FAQ คำถามที่พบบ่อย

Q: GitOps สำหรับ ML models ต่างจาก GitOps สำหรับ applications อย่างไร?

A: ML GitOps มีความซับซ้อนเพิ่มเติมคือ model artifacts มีขนาดใหญ่ไม่ควรเก็บใน Git โดยตรง (ใช้ Model Registry แทน), ต้องมี model validation ก่อน deploy (ตรวจ metrics, bias, drift), rollback อาจต้อง revert ทั้ง model version และ feature pipeline และ monitoring ต้องดูทั้ง system metrics และ model performance

Q: ArgoCD กับ Flux เลือกอันไหนสำหรับ ML GitOps?

A: ArgoCD มี web UI ที่ดี เหมาะสำหรับทีมที่ต้องการ visibility มี application management ที่ครบถ้วน Flux เป็น lightweight กว่า integrate กับ Git ได้ดีกว่า เหมาะสำหรับทีมที่ prefer CLI สำหรับ ML GitOps ทั้งสองใช้ได้ดี แต่ ArgoCD มี rollback UI ที่สะดวกกว่าสำหรับ non-technical stakeholders

Q: Model version ควรจัดการอย่างไร?

A: ใช้ semantic versioning สำหรับ models เช่น major.minor.patch โดย major เปลี่ยนเมื่อ architecture เปลี่ยน minor เปลี่ยนเมื่อ retrain ด้วย data ใหม่หรือ hyperparameter tuning patch เปลี่ยนเมื่อ fix bugs เก็บ version ใน MLflow Model Registry และอ้างอิงใน GitOps config files

Q: จะทำ canary deployment สำหรับ ML models อย่างไร?

A: ใช้ Kubernetes service mesh เช่น Istio หรือ Linkerd สร้าง traffic splitting ระหว่าง model version เก่ากับใหม่ เริ่มจาก 5% traffic ไปที่ model ใหม่ monitor metrics เช่น accuracy, latency, error rate ถ้าผ่านค่อยเพิ่มเป็น 25%, 50%, 100% ถ้าไม่ผ่าน rollback อัตโนมัติ ทั้งหมดจัดการผ่าน GitOps configs

📖 บทความที่เกี่ยวข้อง

Model Registry Monitoring และ Alertingอ่านบทความ → Apache Arrow GitOps Workflowอ่านบทความ → Model Registry Network Segmentationอ่านบทความ → PostgreSQL Full Text Search GitOps Workflowอ่านบทความ → WordPress Headless GitOps Workflowอ่านบทความ →

📚 ดูบทความทั้งหมด →