SiamCafe.net Blog
Technology

Linkerd Service Mesh MLOps Workflow Canary Deployment สำหรับ ML Models

linkerd service mesh mlops workflow
Linkerd Service Mesh MLOps Workflow | SiamCafe Blog
2025-07-05· อ. บอม — SiamCafe.net· 1,316 คำ

Linkerd Service Mesh ????????? MLOps ?????????????????????

Linkerd ???????????? ultralight service mesh ?????????????????? Kubernetes ??????????????????????????? Rust (data plane) ?????????????????????????????? ????????? resources ???????????? ????????????????????????????????? ????????? mTLS ??????????????????????????? observability ????????? traffic management ??????????????????????????????????????? application code

MLOps ????????? practices ?????????????????? Machine Learning ????????? DevOps ???????????????????????? model training, versioning, deployment, monitoring ????????? retraining ?????????????????? Linkerd ????????? MLOps ???????????? Canary deployments ?????????????????? ML models (??????????????? model ????????????????????? traffic ????????????), mTLS encryption ????????????????????? ML services, Observability ???????????? latency/error rate ????????? model inference, Traffic splitting A/B test models ????????????????????????????????????, Retry/timeout policies ?????????????????? model serving

???????????????????????? Linkerd ???????????????????????? Istio Lightweight ????????? memory ???????????????????????? 10x, ????????????????????????????????? (????????????????????? 2 ??????????????????), mTLS ??????????????????????????? zero-config, CNCF graduated project, ????????????????????? sidecar injection config ?????????????????????

????????????????????? Linkerd ?????? Kubernetes

Setup Linkerd ?????????????????? MLOps cluster

# === Linkerd Installation ===

# 1. Install Linkerd CLI
curl --proto '=https' --tlsv1.2 -sSfL https://run.linkerd.io/install | sh
export PATH=$HOME/.linkerd2/bin:$PATH

# 2. Validate cluster
linkerd check --pre

# 3. Install Linkerd CRDs
linkerd install --crds | kubectl apply -f -

# 4. Install Linkerd control plane
linkerd install | kubectl apply -f -

# 5. Verify installation
linkerd check

# 6. Install Linkerd Viz (dashboard)
linkerd viz install | kubectl apply -f -
linkerd viz check

# 7. Create MLOps namespace with Linkerd injection
kubectl create namespace mlops
kubectl annotate namespace mlops linkerd.io/inject=enabled

# 8. Deploy ML Model Serving Infrastructure
cat > ml-serving.yaml << 'EOF'
apiVersion: apps/v1
kind: Deployment
metadata:
  name: model-server-v1
  namespace: mlops
  labels:
    app: model-server
    version: v1
spec:
  replicas: 3
  selector:
    matchLabels:
      app: model-server
      version: v1
  template:
    metadata:
      labels:
        app: model-server
        version: v1
    spec:
      containers:
        - name: model-server
          image: myregistry/model-server:v1.0
          ports:
            - containerPort: 8080
          resources:
            requests:
              cpu: 500m
              memory: 1Gi
            limits:
              cpu: 2
              memory: 4Gi
          readinessProbe:
            httpGet:
              path: /health
              port: 8080
            initialDelaySeconds: 30
            periodSeconds: 10
          env:
            - name: MODEL_PATH
              value: "/models/production/v1"
            - name: MAX_BATCH_SIZE
              value: "32"
---
apiVersion: v1
kind: Service
metadata:
  name: model-server
  namespace: mlops
spec:
  selector:
    app: model-server
  ports:
    - port: 8080
      targetPort: 8080
---
# Feature Store Service
apiVersion: apps/v1
kind: Deployment
metadata:
  name: feature-store
  namespace: mlops
spec:
  replicas: 2
  selector:
    matchLabels:
      app: feature-store
  template:
    metadata:
      labels:
        app: feature-store
    spec:
      containers:
        - name: feature-store
          image: myregistry/feature-store:latest
          ports:
            - containerPort: 8081
          resources:
            requests:
              cpu: 250m
              memory: 512Mi
EOF

kubectl apply -f ml-serving.yaml

# 9. Verify Linkerd proxy injected
kubectl get pods -n mlops -o jsonpath='{.items[*].spec.containers[*].name}' | tr ' ' '\n' | sort -u
# Should show: linkerd-proxy, model-server, feature-store

echo "Linkerd + MLOps infrastructure ready"

MLOps Pipeline ???????????? Linkerd

??????????????? MLOps pipeline ?????????????????? Linkerd features

#!/usr/bin/env python3
# mlops_pipeline.py ??? MLOps Pipeline with Linkerd
import json
import logging
from typing import Dict, List

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger("mlops")

class MLOpsPipeline:
    """MLOps pipeline integrated with Linkerd service mesh"""
    
    def __init__(self):
        self.models = {}
        self.deployments = []
    
    def pipeline_stages(self):
        """Define MLOps pipeline stages"""
        return {
            "1_data_ingestion": {
                "description": "Ingest and validate training data",
                "services": ["data-collector", "data-validator", "feature-store"],
                "linkerd_features": ["mTLS (encrypt data in transit)", "Retry policy (handle transient failures)"],
            },
            "2_training": {
                "description": "Train ML model",
                "services": ["training-job", "experiment-tracker", "model-registry"],
                "linkerd_features": ["Timeout policy (prevent stuck training jobs)", "Observability (track training service health)"],
            },
            "3_validation": {
                "description": "Validate model quality",
                "services": ["model-validator", "test-data-service", "metric-collector"],
                "linkerd_features": ["Traffic splitting (shadow testing)", "mTLS"],
            },
            "4_deployment": {
                "description": "Deploy model to production",
                "services": ["model-server-v1", "model-server-v2", "api-gateway"],
                "linkerd_features": ["Canary deployment", "Traffic splitting", "Automatic rollback"],
            },
            "5_monitoring": {
                "description": "Monitor model performance",
                "services": ["model-monitor", "drift-detector", "alerting"],
                "linkerd_features": ["Golden metrics (latency, success rate)", "Per-route metrics"],
            },
            "6_retraining": {
                "description": "Trigger retraining when drift detected",
                "services": ["retrain-trigger", "training-job", "model-registry"],
                "linkerd_features": ["Circuit breaking (protect during retraining)", "Load balancing"],
            },
        }
    
    def canary_deployment(self, model_name, new_version, canary_weight=10):
        """Configure canary deployment for ML model"""
        return {
            "model": model_name,
            "current_version": "v1",
            "canary_version": new_version,
            "traffic_split": {
                "stable": 100 - canary_weight,
                "canary": canary_weight,
            },
            "success_criteria": {
                "latency_p99_ms": 200,
                "error_rate_max": 0.01,
                "accuracy_min": 0.95,
            },
            "rollout_steps": [
                {"weight": 10, "duration": "5m", "check": "metrics"},
                {"weight": 25, "duration": "10m", "check": "metrics"},
                {"weight": 50, "duration": "15m", "check": "metrics"},
                {"weight": 75, "duration": "10m", "check": "metrics"},
                {"weight": 100, "duration": "5m", "check": "final"},
            ],
            "rollback_on": "error_rate > 1% OR latency_p99 > 500ms OR accuracy < 90%",
        }

pipeline = MLOpsPipeline()
stages = pipeline.pipeline_stages()
print("MLOps Pipeline Stages:")
for stage, info in stages.items():
    print(f"\n  {stage}: {info['description']}")
    print(f"    Services: {', '.join(info['services'])}")
    print(f"    Linkerd: {info['linkerd_features'][0]}")

canary = pipeline.canary_deployment("recommendation-model", "v2", 10)
print(f"\nCanary Deployment:")
print(f"  Model: {canary['model']} ({canary['current_version']} ??? {canary['canary_version']})")
print(f"  Traffic: Stable {canary['traffic_split']['stable']}% / Canary {canary['traffic_split']['canary']}%")
print(f"  Rollback: {canary['rollback_on']}")

Traffic Management ?????????????????? ML Models

?????????????????? traffic ?????????????????? ML model serving

# === Traffic Management ===

# 1. Traffic Split for A/B Testing Models
cat > traffic-split.yaml << 'EOF'
apiVersion: split.smi-spec.io/v1alpha2
kind: TrafficSplit
metadata:
  name: model-server-split
  namespace: mlops
spec:
  service: model-server
  backends:
    - service: model-server-v1
      weight: 900    # 90% traffic
    - service: model-server-v2
      weight: 100    # 10% traffic (canary)
---
# Service for v1
apiVersion: v1
kind: Service
metadata:
  name: model-server-v1
  namespace: mlops
spec:
  selector:
    app: model-server
    version: v1
  ports:
    - port: 8080
---
# Service for v2 (canary)
apiVersion: v1
kind: Service
metadata:
  name: model-server-v2
  namespace: mlops
spec:
  selector:
    app: model-server
    version: v2
  ports:
    - port: 8080
EOF

kubectl apply -f traffic-split.yaml

# 2. Service Profiles (retry, timeout, routes)
cat > service-profile.yaml << 'EOF'
apiVersion: linkerd.io/v1alpha2
kind: ServiceProfile
metadata:
  name: model-server.mlops.svc.cluster.local
  namespace: mlops
spec:
  routes:
    - name: predict
      condition:
        method: POST
        pathRegex: /v1/predict
      timeout: 5s
      isRetryable: false    # Don't retry predictions (idempotency)
      
    - name: health
      condition:
        method: GET
        pathRegex: /health
      timeout: 2s
      isRetryable: true
      
    - name: batch-predict
      condition:
        method: POST
        pathRegex: /v1/batch-predict
      timeout: 30s
      isRetryable: false
      
    - name: model-metadata
      condition:
        method: GET
        pathRegex: /v1/models/.*
      timeout: 3s
      isRetryable: true
  
  retryBudget:
    retryRatio: 0.2        # Max 20% of requests are retries
    minRetriesPerSecond: 10
    ttl: 10s
EOF

kubectl apply -f service-profile.yaml

# 3. Gradual canary rollout script
cat > canary_rollout.sh << 'BASH'
#!/bin/bash
# Gradual canary rollout for ML model
MODEL_SERVICE="model-server"
NAMESPACE="mlops"
WEIGHTS=(10 25 50 75 100)
WAIT_MINUTES=(5 10 15 10 5)

for i in ""; do
  CANARY_WEIGHT=
  STABLE_WEIGHT=$((1000 - CANARY_WEIGHT * 10))
  
  echo "Setting canary weight to %..."
  
  kubectl -n $NAMESPACE patch trafficsplit $MODEL_SERVICE-split --type=json -p="[
    {\"op\": \"replace\", \"path\": \"/spec/backends/0/weight\", \"value\": $STABLE_WEIGHT},
    {\"op\": \"replace\", \"path\": \"/spec/backends/1/weight\", \"value\": $((CANARY_WEIGHT * 10))}
  ]"
  
  echo "Waiting  minutes..."
  sleep $((WAIT_MINUTES[$i] * 60))
  
  # Check metrics
  ERROR_RATE=$(linkerd viz stat deploy/model-server-v2 -n $NAMESPACE --to deploy/model-server -o json | \
    python3 -c "import json,sys; d=json.load(sys.stdin); print(d.get('error_rate','0'))")
  
  if (( $(echo "$ERROR_RATE > 0.01" | bc -l) )); then
    echo "ERROR: High error rate ($ERROR_RATE). Rolling back!"
    kubectl -n $NAMESPACE patch trafficsplit $MODEL_SERVICE-split --type=json -p="[
      {\"op\": \"replace\", \"path\": \"/spec/backends/0/weight\", \"value\": 1000},
      {\"op\": \"replace\", \"path\": \"/spec/backends/1/weight\", \"value\": 0}
    ]"
    exit 1
  fi
  
  echo "Canary at %: OK (error_rate=$ERROR_RATE)"
done

echo "Canary rollout complete! Model v2 is now serving 100% traffic."
BASH

chmod +x canary_rollout.sh
echo "Traffic management configured"

Observability ????????? Model Monitoring

?????????????????? ML model performance ???????????? Linkerd

#!/usr/bin/env python3
# model_monitor.py ??? ML Model Monitoring with Linkerd Metrics
import json
import logging
from typing import Dict, List

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger("monitor")

class ModelMonitor:
    """Monitor ML model performance using Linkerd metrics"""
    
    def __init__(self):
        pass
    
    def dashboard(self):
        return {
            "model_serving": {
                "model-server-v1": {
                    "rps": 150,
                    "success_rate": "99.5%",
                    "latency_p50": "25ms",
                    "latency_p95": "85ms",
                    "latency_p99": "150ms",
                    "tcp_connections": 45,
                },
                "model-server-v2": {
                    "rps": 15,
                    "success_rate": "99.2%",
                    "latency_p50": "22ms",
                    "latency_p95": "78ms",
                    "latency_p99": "140ms",
                    "tcp_connections": 8,
                },
            },
            "traffic_split": {
                "stable_v1": "90%",
                "canary_v2": "10%",
                "status": "Canary in progress (step 1/5)",
            },
            "model_metrics": {
                "v1": {"accuracy": 0.952, "precision": 0.948, "recall": 0.955, "f1": 0.951},
                "v2": {"accuracy": 0.961, "precision": 0.958, "recall": 0.963, "f1": 0.960},
            },
            "data_drift": {
                "feature_drift_score": 0.12,
                "prediction_drift_score": 0.08,
                "threshold": 0.15,
                "status": "Normal (no significant drift)",
            },
            "infrastructure": {
                "mtls_enabled": True,
                "proxy_cpu_usage": "50m per pod",
                "proxy_memory_usage": "20Mi per pod",
                "certificates_valid": True,
                "cert_expiry": "23 days",
            },
            "alerts": [
                {"severity": "INFO", "message": "Canary v2 performing 1% better accuracy than v1"},
                {"severity": "INFO", "message": "mTLS certificates auto-rotated successfully"},
            ],
        }

monitor = ModelMonitor()
dash = monitor.dashboard()

print("ML Model Monitoring (Linkerd):")
for model, metrics in dash["model_serving"].items():
    print(f"\n  {model}:")
    print(f"    RPS: {metrics['rps']}, Success: {metrics['success_rate']}")
    print(f"    Latency: P50={metrics['latency_p50']}, P95={metrics['latency_p95']}, P99={metrics['latency_p99']}")

split = dash["traffic_split"]
print(f"\nTraffic Split: Stable {split['stable_v1']} / Canary {split['canary_v2']}")
print(f"  Status: {split['status']}")

for ver, m in dash["model_metrics"].items():
    print(f"\n  Model {ver}: Accuracy={m['accuracy']}, F1={m['f1']}")

drift = dash["data_drift"]
print(f"\nData Drift: {drift['status']} (score={drift['feature_drift_score']}, threshold={drift['threshold']})")

for a in dash["alerts"]:
    print(f"\n[{a['severity']}] {a['message']}")

Security ????????? mTLS

??????????????????????????????????????????????????? ML services

# === Security with Linkerd ===

# 1. Verify mTLS
linkerd viz edges deployment -n mlops
# Shows all connections with mTLS status

# 2. Authorization Policy (restrict access)
cat > auth-policy.yaml << 'EOF'
# Only API gateway can access model-server
apiVersion: policy.linkerd.io/v1beta2
kind: Server
metadata:
  name: model-server
  namespace: mlops
spec:
  podSelector:
    matchLabels:
      app: model-server
  port: 8080
  proxyProtocol: HTTP/2
---
apiVersion: policy.linkerd.io/v1alpha1
kind: AuthorizationPolicy
metadata:
  name: model-server-auth
  namespace: mlops
spec:
  targetRef:
    group: policy.linkerd.io
    kind: Server
    name: model-server
  requiredAuthenticationRefs:
    - name: mlops-mtls
      kind: MeshTLSAuthentication
      group: policy.linkerd.io
---
apiVersion: policy.linkerd.io/v1alpha1
kind: MeshTLSAuthentication
metadata:
  name: mlops-mtls
  namespace: mlops
spec:
  identities:
    - "*.mlops.serviceaccount.identity.linkerd.cluster.local"
---
# Network Authentication (restrict to specific service accounts)
apiVersion: policy.linkerd.io/v1alpha1
kind: AuthorizationPolicy
metadata:
  name: model-server-gateway-only
  namespace: mlops
spec:
  targetRef:
    group: policy.linkerd.io
    kind: Server
    name: model-server
  requiredAuthenticationRefs:
    - name: gateway-identity
      kind: MeshTLSAuthentication
      group: policy.linkerd.io
---
apiVersion: policy.linkerd.io/v1alpha1
kind: MeshTLSAuthentication
metadata:
  name: gateway-identity
  namespace: mlops
spec:
  identities:
    - "api-gateway.mlops.serviceaccount.identity.linkerd.cluster.local"
EOF

kubectl apply -f auth-policy.yaml

# 3. Check mTLS status
linkerd viz stat deploy -n mlops
# Shows secured/unsecured connections

# 4. Certificate rotation check
linkerd identity
# Shows certificate details and expiry

echo "Security policies configured"

FAQ ??????????????????????????????????????????

Q: Linkerd ????????? Istio ??????????????????????????????????????????????????? MLOps?

A: Linkerd ??????????????? ?????????????????? (proxy ????????? ~20MB memory vs Istio ~100MB+), ????????????????????????????????? (2 commands), mTLS ??????????????????????????? zero-config, CNCF graduated ????????????????????? Features ???????????????????????? Istio (??????????????? Wasm extensibility, limited traffic management) Istio ??????????????? Features ????????? (Wasm plugins, complex routing, Envoy extensibility), Ecosystem ???????????? ????????????????????? ?????????????????????, ????????? resources ?????????, learning curve ????????? ?????????????????? MLOps ??????????????? Linkerd ????????? ????????????????????? simplicity, ?????? limited ops team, cluster ????????????-????????????, ????????????????????? mTLS + basic traffic splitting ??????????????? Istio ????????? ????????????????????? advanced traffic management (header-based routing), Wasm plugins, ?????? dedicated platform team

Q: Canary deployment ?????????????????? ML model ????????????????????????????

A: ????????? Linkerd TrafficSplit (SMI spec) ??????????????? 2 deployments (v1 stable, v2 canary), ??????????????? Service ????????????????????????????????? version, ??????????????? TrafficSplit resource ??????????????? weight (???????????? v1=90%, v2=10%), Monitor metrics (latency, error rate, model accuracy), ?????????????????????????????? weight ????????? metrics ?????? (10???25???50???75???100), Rollback ????????? metrics ????????? (set v2 weight = 0) ????????????????????????????????? monitor ?????????????????? infrastructure metrics (latency, errors) ???????????? monitor model-specific metrics ???????????? accuracy, precision, recall ????????? model ????????????, prediction distribution ????????????????????? model ??????????????????????????????, data drift ????????? tools ???????????? Evidently AI, Seldon Alibi Detect

Q: Linkerd proxy ??????????????? latency ??????????????????????

A: Linkerd proxy (linkerd2-proxy) ??????????????????????????? Rust ????????????????????? ??????????????? latency ?????????????????? P50 < 1ms, P99 < 5ms ?????????????????? ML inference ?????????????????????????????????????????? 20-200ms latency ??????????????????????????????????????????????????????????????? (< 5%) Memory overhead ~20MB per proxy, CPU overhead ~50m per proxy ???????????????????????? Istio/Envoy P50 ~3ms, P99 ~10ms, Memory ~100MB+ ????????? latency ???????????????????????? (< 5ms inference) ?????????????????????????????????????????? Skip proxy ?????????????????? internal high-frequency calls ????????? proxy ??????????????? ingress/egress

Q: Feature Store ?????? MLOps mesh ???????????? configure ??????????????????????

A: Feature Store ???????????? critical service ?????? MLOps pipeline ???????????? configure Retry policy ???????????? retry ?????????????????? GET requests (feature lookup), ????????? retry ?????????????????? write operations, Timeout ???????????? timeout ???????????? (2-5s) ?????????????????? real-time feature serving, timeout ????????? (30s+) ?????????????????? batch feature retrieval, Connection pooling ????????? Linkerd load balancing (EWMA algorithm) ?????????????????? requests, mTLS encrypt feature data in transit (??????????????? PII), Authorization Policy ??????????????????????????????????????? model-server ????????? training-job access feature-store ?????? Linkerd ServiceProfile ??????????????? routes ??????????????????????????? real-time vs batch endpoints ???????????? timeout/retry ?????????????????????

📖 บทความที่เกี่ยวข้อง

Linkerd Service Mesh Production Setup Guideอ่านบทความ → Flatcar Container Linux Service Mesh Setupอ่านบทความ → DALL-E API MLOps Workflowอ่านบทความ → Tailwind CSS v4 MLOps Workflowอ่านบทความ → Linkerd Service Mesh Serverless Architectureอ่านบทความ →

📚 ดูบทความทั้งหมด →