Linkerd Service Mesh ????????? MLOps ?????????????????????
Linkerd ???????????? ultralight service mesh ?????????????????? Kubernetes ??????????????????????????? Rust (data plane) ?????????????????????????????? ????????? resources ???????????? ????????????????????????????????? ????????? mTLS ??????????????????????????? observability ????????? traffic management ??????????????????????????????????????? application code
MLOps ????????? practices ?????????????????? Machine Learning ????????? DevOps ???????????????????????? model training, versioning, deployment, monitoring ????????? retraining ?????????????????? Linkerd ????????? MLOps ???????????? Canary deployments ?????????????????? ML models (??????????????? model ????????????????????? traffic ????????????), mTLS encryption ????????????????????? ML services, Observability ???????????? latency/error rate ????????? model inference, Traffic splitting A/B test models ????????????????????????????????????, Retry/timeout policies ?????????????????? model serving
???????????????????????? Linkerd ???????????????????????? Istio Lightweight ????????? memory ???????????????????????? 10x, ????????????????????????????????? (????????????????????? 2 ??????????????????), mTLS ??????????????????????????? zero-config, CNCF graduated project, ????????????????????? sidecar injection config ?????????????????????
????????????????????? Linkerd ?????? Kubernetes
Setup Linkerd ?????????????????? MLOps cluster
# === Linkerd Installation ===
# 1. Install Linkerd CLI
curl --proto '=https' --tlsv1.2 -sSfL https://run.linkerd.io/install | sh
export PATH=$HOME/.linkerd2/bin:$PATH
# 2. Validate cluster
linkerd check --pre
# 3. Install Linkerd CRDs
linkerd install --crds | kubectl apply -f -
# 4. Install Linkerd control plane
linkerd install | kubectl apply -f -
# 5. Verify installation
linkerd check
# 6. Install Linkerd Viz (dashboard)
linkerd viz install | kubectl apply -f -
linkerd viz check
# 7. Create MLOps namespace with Linkerd injection
kubectl create namespace mlops
kubectl annotate namespace mlops linkerd.io/inject=enabled
# 8. Deploy ML Model Serving Infrastructure
cat > ml-serving.yaml << 'EOF'
apiVersion: apps/v1
kind: Deployment
metadata:
name: model-server-v1
namespace: mlops
labels:
app: model-server
version: v1
spec:
replicas: 3
selector:
matchLabels:
app: model-server
version: v1
template:
metadata:
labels:
app: model-server
version: v1
spec:
containers:
- name: model-server
image: myregistry/model-server:v1.0
ports:
- containerPort: 8080
resources:
requests:
cpu: 500m
memory: 1Gi
limits:
cpu: 2
memory: 4Gi
readinessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
env:
- name: MODEL_PATH
value: "/models/production/v1"
- name: MAX_BATCH_SIZE
value: "32"
---
apiVersion: v1
kind: Service
metadata:
name: model-server
namespace: mlops
spec:
selector:
app: model-server
ports:
- port: 8080
targetPort: 8080
---
# Feature Store Service
apiVersion: apps/v1
kind: Deployment
metadata:
name: feature-store
namespace: mlops
spec:
replicas: 2
selector:
matchLabels:
app: feature-store
template:
metadata:
labels:
app: feature-store
spec:
containers:
- name: feature-store
image: myregistry/feature-store:latest
ports:
- containerPort: 8081
resources:
requests:
cpu: 250m
memory: 512Mi
EOF
kubectl apply -f ml-serving.yaml
# 9. Verify Linkerd proxy injected
kubectl get pods -n mlops -o jsonpath='{.items[*].spec.containers[*].name}' | tr ' ' '\n' | sort -u
# Should show: linkerd-proxy, model-server, feature-store
echo "Linkerd + MLOps infrastructure ready"
MLOps Pipeline ???????????? Linkerd
??????????????? MLOps pipeline ?????????????????? Linkerd features
#!/usr/bin/env python3
# mlops_pipeline.py ??? MLOps Pipeline with Linkerd
import json
import logging
from typing import Dict, List
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger("mlops")
class MLOpsPipeline:
"""MLOps pipeline integrated with Linkerd service mesh"""
def __init__(self):
self.models = {}
self.deployments = []
def pipeline_stages(self):
"""Define MLOps pipeline stages"""
return {
"1_data_ingestion": {
"description": "Ingest and validate training data",
"services": ["data-collector", "data-validator", "feature-store"],
"linkerd_features": ["mTLS (encrypt data in transit)", "Retry policy (handle transient failures)"],
},
"2_training": {
"description": "Train ML model",
"services": ["training-job", "experiment-tracker", "model-registry"],
"linkerd_features": ["Timeout policy (prevent stuck training jobs)", "Observability (track training service health)"],
},
"3_validation": {
"description": "Validate model quality",
"services": ["model-validator", "test-data-service", "metric-collector"],
"linkerd_features": ["Traffic splitting (shadow testing)", "mTLS"],
},
"4_deployment": {
"description": "Deploy model to production",
"services": ["model-server-v1", "model-server-v2", "api-gateway"],
"linkerd_features": ["Canary deployment", "Traffic splitting", "Automatic rollback"],
},
"5_monitoring": {
"description": "Monitor model performance",
"services": ["model-monitor", "drift-detector", "alerting"],
"linkerd_features": ["Golden metrics (latency, success rate)", "Per-route metrics"],
},
"6_retraining": {
"description": "Trigger retraining when drift detected",
"services": ["retrain-trigger", "training-job", "model-registry"],
"linkerd_features": ["Circuit breaking (protect during retraining)", "Load balancing"],
},
}
def canary_deployment(self, model_name, new_version, canary_weight=10):
"""Configure canary deployment for ML model"""
return {
"model": model_name,
"current_version": "v1",
"canary_version": new_version,
"traffic_split": {
"stable": 100 - canary_weight,
"canary": canary_weight,
},
"success_criteria": {
"latency_p99_ms": 200,
"error_rate_max": 0.01,
"accuracy_min": 0.95,
},
"rollout_steps": [
{"weight": 10, "duration": "5m", "check": "metrics"},
{"weight": 25, "duration": "10m", "check": "metrics"},
{"weight": 50, "duration": "15m", "check": "metrics"},
{"weight": 75, "duration": "10m", "check": "metrics"},
{"weight": 100, "duration": "5m", "check": "final"},
],
"rollback_on": "error_rate > 1% OR latency_p99 > 500ms OR accuracy < 90%",
}
pipeline = MLOpsPipeline()
stages = pipeline.pipeline_stages()
print("MLOps Pipeline Stages:")
for stage, info in stages.items():
print(f"\n {stage}: {info['description']}")
print(f" Services: {', '.join(info['services'])}")
print(f" Linkerd: {info['linkerd_features'][0]}")
canary = pipeline.canary_deployment("recommendation-model", "v2", 10)
print(f"\nCanary Deployment:")
print(f" Model: {canary['model']} ({canary['current_version']} ??? {canary['canary_version']})")
print(f" Traffic: Stable {canary['traffic_split']['stable']}% / Canary {canary['traffic_split']['canary']}%")
print(f" Rollback: {canary['rollback_on']}")
Traffic Management ?????????????????? ML Models
?????????????????? traffic ?????????????????? ML model serving
# === Traffic Management ===
# 1. Traffic Split for A/B Testing Models
cat > traffic-split.yaml << 'EOF'
apiVersion: split.smi-spec.io/v1alpha2
kind: TrafficSplit
metadata:
name: model-server-split
namespace: mlops
spec:
service: model-server
backends:
- service: model-server-v1
weight: 900 # 90% traffic
- service: model-server-v2
weight: 100 # 10% traffic (canary)
---
# Service for v1
apiVersion: v1
kind: Service
metadata:
name: model-server-v1
namespace: mlops
spec:
selector:
app: model-server
version: v1
ports:
- port: 8080
---
# Service for v2 (canary)
apiVersion: v1
kind: Service
metadata:
name: model-server-v2
namespace: mlops
spec:
selector:
app: model-server
version: v2
ports:
- port: 8080
EOF
kubectl apply -f traffic-split.yaml
# 2. Service Profiles (retry, timeout, routes)
cat > service-profile.yaml << 'EOF'
apiVersion: linkerd.io/v1alpha2
kind: ServiceProfile
metadata:
name: model-server.mlops.svc.cluster.local
namespace: mlops
spec:
routes:
- name: predict
condition:
method: POST
pathRegex: /v1/predict
timeout: 5s
isRetryable: false # Don't retry predictions (idempotency)
- name: health
condition:
method: GET
pathRegex: /health
timeout: 2s
isRetryable: true
- name: batch-predict
condition:
method: POST
pathRegex: /v1/batch-predict
timeout: 30s
isRetryable: false
- name: model-metadata
condition:
method: GET
pathRegex: /v1/models/.*
timeout: 3s
isRetryable: true
retryBudget:
retryRatio: 0.2 # Max 20% of requests are retries
minRetriesPerSecond: 10
ttl: 10s
EOF
kubectl apply -f service-profile.yaml
# 3. Gradual canary rollout script
cat > canary_rollout.sh << 'BASH'
#!/bin/bash
# Gradual canary rollout for ML model
MODEL_SERVICE="model-server"
NAMESPACE="mlops"
WEIGHTS=(10 25 50 75 100)
WAIT_MINUTES=(5 10 15 10 5)
for i in ""; do
CANARY_WEIGHT=
STABLE_WEIGHT=$((1000 - CANARY_WEIGHT * 10))
echo "Setting canary weight to %..."
kubectl -n $NAMESPACE patch trafficsplit $MODEL_SERVICE-split --type=json -p="[
{\"op\": \"replace\", \"path\": \"/spec/backends/0/weight\", \"value\": $STABLE_WEIGHT},
{\"op\": \"replace\", \"path\": \"/spec/backends/1/weight\", \"value\": $((CANARY_WEIGHT * 10))}
]"
echo "Waiting minutes..."
sleep $((WAIT_MINUTES[$i] * 60))
# Check metrics
ERROR_RATE=$(linkerd viz stat deploy/model-server-v2 -n $NAMESPACE --to deploy/model-server -o json | \
python3 -c "import json,sys; d=json.load(sys.stdin); print(d.get('error_rate','0'))")
if (( $(echo "$ERROR_RATE > 0.01" | bc -l) )); then
echo "ERROR: High error rate ($ERROR_RATE). Rolling back!"
kubectl -n $NAMESPACE patch trafficsplit $MODEL_SERVICE-split --type=json -p="[
{\"op\": \"replace\", \"path\": \"/spec/backends/0/weight\", \"value\": 1000},
{\"op\": \"replace\", \"path\": \"/spec/backends/1/weight\", \"value\": 0}
]"
exit 1
fi
echo "Canary at %: OK (error_rate=$ERROR_RATE)"
done
echo "Canary rollout complete! Model v2 is now serving 100% traffic."
BASH
chmod +x canary_rollout.sh
echo "Traffic management configured"
Observability ????????? Model Monitoring
?????????????????? ML model performance ???????????? Linkerd
#!/usr/bin/env python3
# model_monitor.py ??? ML Model Monitoring with Linkerd Metrics
import json
import logging
from typing import Dict, List
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger("monitor")
class ModelMonitor:
"""Monitor ML model performance using Linkerd metrics"""
def __init__(self):
pass
def dashboard(self):
return {
"model_serving": {
"model-server-v1": {
"rps": 150,
"success_rate": "99.5%",
"latency_p50": "25ms",
"latency_p95": "85ms",
"latency_p99": "150ms",
"tcp_connections": 45,
},
"model-server-v2": {
"rps": 15,
"success_rate": "99.2%",
"latency_p50": "22ms",
"latency_p95": "78ms",
"latency_p99": "140ms",
"tcp_connections": 8,
},
},
"traffic_split": {
"stable_v1": "90%",
"canary_v2": "10%",
"status": "Canary in progress (step 1/5)",
},
"model_metrics": {
"v1": {"accuracy": 0.952, "precision": 0.948, "recall": 0.955, "f1": 0.951},
"v2": {"accuracy": 0.961, "precision": 0.958, "recall": 0.963, "f1": 0.960},
},
"data_drift": {
"feature_drift_score": 0.12,
"prediction_drift_score": 0.08,
"threshold": 0.15,
"status": "Normal (no significant drift)",
},
"infrastructure": {
"mtls_enabled": True,
"proxy_cpu_usage": "50m per pod",
"proxy_memory_usage": "20Mi per pod",
"certificates_valid": True,
"cert_expiry": "23 days",
},
"alerts": [
{"severity": "INFO", "message": "Canary v2 performing 1% better accuracy than v1"},
{"severity": "INFO", "message": "mTLS certificates auto-rotated successfully"},
],
}
monitor = ModelMonitor()
dash = monitor.dashboard()
print("ML Model Monitoring (Linkerd):")
for model, metrics in dash["model_serving"].items():
print(f"\n {model}:")
print(f" RPS: {metrics['rps']}, Success: {metrics['success_rate']}")
print(f" Latency: P50={metrics['latency_p50']}, P95={metrics['latency_p95']}, P99={metrics['latency_p99']}")
split = dash["traffic_split"]
print(f"\nTraffic Split: Stable {split['stable_v1']} / Canary {split['canary_v2']}")
print(f" Status: {split['status']}")
for ver, m in dash["model_metrics"].items():
print(f"\n Model {ver}: Accuracy={m['accuracy']}, F1={m['f1']}")
drift = dash["data_drift"]
print(f"\nData Drift: {drift['status']} (score={drift['feature_drift_score']}, threshold={drift['threshold']})")
for a in dash["alerts"]:
print(f"\n[{a['severity']}] {a['message']}")
Security ????????? mTLS
??????????????????????????????????????????????????? ML services
# === Security with Linkerd ===
# 1. Verify mTLS
linkerd viz edges deployment -n mlops
# Shows all connections with mTLS status
# 2. Authorization Policy (restrict access)
cat > auth-policy.yaml << 'EOF'
# Only API gateway can access model-server
apiVersion: policy.linkerd.io/v1beta2
kind: Server
metadata:
name: model-server
namespace: mlops
spec:
podSelector:
matchLabels:
app: model-server
port: 8080
proxyProtocol: HTTP/2
---
apiVersion: policy.linkerd.io/v1alpha1
kind: AuthorizationPolicy
metadata:
name: model-server-auth
namespace: mlops
spec:
targetRef:
group: policy.linkerd.io
kind: Server
name: model-server
requiredAuthenticationRefs:
- name: mlops-mtls
kind: MeshTLSAuthentication
group: policy.linkerd.io
---
apiVersion: policy.linkerd.io/v1alpha1
kind: MeshTLSAuthentication
metadata:
name: mlops-mtls
namespace: mlops
spec:
identities:
- "*.mlops.serviceaccount.identity.linkerd.cluster.local"
---
# Network Authentication (restrict to specific service accounts)
apiVersion: policy.linkerd.io/v1alpha1
kind: AuthorizationPolicy
metadata:
name: model-server-gateway-only
namespace: mlops
spec:
targetRef:
group: policy.linkerd.io
kind: Server
name: model-server
requiredAuthenticationRefs:
- name: gateway-identity
kind: MeshTLSAuthentication
group: policy.linkerd.io
---
apiVersion: policy.linkerd.io/v1alpha1
kind: MeshTLSAuthentication
metadata:
name: gateway-identity
namespace: mlops
spec:
identities:
- "api-gateway.mlops.serviceaccount.identity.linkerd.cluster.local"
EOF
kubectl apply -f auth-policy.yaml
# 3. Check mTLS status
linkerd viz stat deploy -n mlops
# Shows secured/unsecured connections
# 4. Certificate rotation check
linkerd identity
# Shows certificate details and expiry
echo "Security policies configured"
FAQ ??????????????????????????????????????????
Q: Linkerd ????????? Istio ??????????????????????????????????????????????????? MLOps?
A: Linkerd ??????????????? ?????????????????? (proxy ????????? ~20MB memory vs Istio ~100MB+), ????????????????????????????????? (2 commands), mTLS ??????????????????????????? zero-config, CNCF graduated ????????????????????? Features ???????????????????????? Istio (??????????????? Wasm extensibility, limited traffic management) Istio ??????????????? Features ????????? (Wasm plugins, complex routing, Envoy extensibility), Ecosystem ???????????? ????????????????????? ?????????????????????, ????????? resources ?????????, learning curve ????????? ?????????????????? MLOps ??????????????? Linkerd ????????? ????????????????????? simplicity, ?????? limited ops team, cluster ????????????-????????????, ????????????????????? mTLS + basic traffic splitting ??????????????? Istio ????????? ????????????????????? advanced traffic management (header-based routing), Wasm plugins, ?????? dedicated platform team
Q: Canary deployment ?????????????????? ML model ????????????????????????????
A: ????????? Linkerd TrafficSplit (SMI spec) ??????????????? 2 deployments (v1 stable, v2 canary), ??????????????? Service ????????????????????????????????? version, ??????????????? TrafficSplit resource ??????????????? weight (???????????? v1=90%, v2=10%), Monitor metrics (latency, error rate, model accuracy), ?????????????????????????????? weight ????????? metrics ?????? (10???25???50???75???100), Rollback ????????? metrics ????????? (set v2 weight = 0) ????????????????????????????????? monitor ?????????????????? infrastructure metrics (latency, errors) ???????????? monitor model-specific metrics ???????????? accuracy, precision, recall ????????? model ????????????, prediction distribution ????????????????????? model ??????????????????????????????, data drift ????????? tools ???????????? Evidently AI, Seldon Alibi Detect
Q: Linkerd proxy ??????????????? latency ??????????????????????
A: Linkerd proxy (linkerd2-proxy) ??????????????????????????? Rust ????????????????????? ??????????????? latency ?????????????????? P50 < 1ms, P99 < 5ms ?????????????????? ML inference ?????????????????????????????????????????? 20-200ms latency ??????????????????????????????????????????????????????????????? (< 5%) Memory overhead ~20MB per proxy, CPU overhead ~50m per proxy ???????????????????????? Istio/Envoy P50 ~3ms, P99 ~10ms, Memory ~100MB+ ????????? latency ???????????????????????? (< 5ms inference) ?????????????????????????????????????????? Skip proxy ?????????????????? internal high-frequency calls ????????? proxy ??????????????? ingress/egress
Q: Feature Store ?????? MLOps mesh ???????????? configure ??????????????????????
A: Feature Store ???????????? critical service ?????? MLOps pipeline ???????????? configure Retry policy ???????????? retry ?????????????????? GET requests (feature lookup), ????????? retry ?????????????????? write operations, Timeout ???????????? timeout ???????????? (2-5s) ?????????????????? real-time feature serving, timeout ????????? (30s+) ?????????????????? batch feature retrieval, Connection pooling ????????? Linkerd load balancing (EWMA algorithm) ?????????????????? requests, mTLS encrypt feature data in transit (??????????????? PII), Authorization Policy ??????????????????????????????????????? model-server ????????? training-job access feature-store ?????? Linkerd ServiceProfile ??????????????? routes ??????????????????????????? real-time vs batch endpoints ???????????? timeout/retry ?????????????????????
