A/B Testing สำหรับ ML Models คืออะไร
A/B Testing สำหรับ Machine Learning Models เป็นวิธีเปรียบเทียบ performance ของ ML models 2 versions (หรือมากกว่า) ใน production โดยส่ง traffic บางส่วนไปยัง model ใหม่ (challenger) และส่วนที่เหลือไปยัง model เดิม (champion) แล้ววัดผลด้วย metrics ที่กำหนด
Progressive Delivery เป็น deployment strategy ที่ค่อยๆ เพิ่ม traffic ไปยัง model ใหม่ทีละน้อย ตั้งแต่ 1% จนถึง 100% พร้อม automated monitoring และ rollback ถ้าพบปัญหา ต่างจาก traditional deployment ที่ switch 100% ทันที Progressive Delivery ลดความเสี่ยงจาก model ใหม่ที่อาจ perform ไม่ดีใน production
Use cases หลักได้แก่ Model Version Comparison เปรียบเทียบ model v1 กับ v2, Feature Importance Testing ทดสอบว่า features ใหม่ช่วยปรับปรุง predictions ไหม, Algorithm Comparison เปรียบเทียบ algorithms ต่างกัน (เช่น XGBoost vs LightGBM), Hyperparameter Tuning ทดสอบ hyperparameters ต่างกันใน production, Business Metric Validation ตรวจสอบว่า model improvements แปลงเป็น business value จริง
ติดตั้ง A/B Testing Framework
สร้าง A/B testing infrastructure สำหรับ ML
# === A/B Testing Framework Setup ===
# 1. Project Structure
mkdir -p ml-ab-testing/{models, router, analytics, config, tests}
# 2. Traffic Router Configuration
cat > config/experiment.yaml << 'EOF'
experiments:
- name: "recommendation_model_v2"
description: "Test new recommendation model with transformer architecture"
status: "running"
start_date: "2025-01-15"
variants:
- name: "control"
model_id: "rec_model_v1"
endpoint: "http://model-v1:8080/predict"
traffic_percentage: 80
- name: "treatment"
model_id: "rec_model_v2"
endpoint: "http://model-v2:8080/predict"
traffic_percentage: 20
metrics:
primary: "click_through_rate"
secondary:
- "conversion_rate"
- "revenue_per_user"
- "latency_p99"
guardrail:
- name: "latency_p99"
threshold_ms: 200
- name: "error_rate"
threshold_pct: 1.0
progressive_rollout:
phases:
- traffic_pct: 5
duration_hours: 24
auto_advance: true
- traffic_pct: 20
duration_hours: 48
auto_advance: true
- traffic_pct: 50
duration_hours: 72
auto_advance: false # Manual approval
- traffic_pct: 100
duration_hours: 0
auto_advance: false
rollback_conditions:
- metric: "error_rate"
operator: ">"
value: 2.0
- metric: "latency_p99"
operator: ">"
value: 300
EOF
# 3. Kubernetes Deployment with Istio Traffic Splitting
cat > k8s/virtual-service.yaml << 'EOF'
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: ml-model-routing
spec:
hosts:
- ml-model.default.svc.cluster.local
http:
- route:
- destination:
host: ml-model-v1
port:
number: 8080
weight: 80
- destination:
host: ml-model-v2
port:
number: 8080
weight: 20
headers:
response:
add:
x-model-version: "%UPSTREAM_METADATA([\"model_version\"])%"
EOF
echo "A/B testing framework configured"
Progressive Delivery สำหรับ ML Models
Implement progressive delivery pipeline
#!/usr/bin/env python3
# progressive_delivery.py — ML Model Progressive Delivery
import json
import logging
from datetime import datetime, timedelta
from typing import Dict, List, Optional
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger("progressive")
class MLProgressiveDelivery:
def __init__(self, experiment_name):
self.experiment = experiment_name
self.phases = [
{"pct": 5, "duration_hours": 24, "auto": True},
{"pct": 20, "duration_hours": 48, "auto": True},
{"pct": 50, "duration_hours": 72, "auto": False},
{"pct": 100, "duration_hours": 0, "auto": False},
]
self.current_phase = 0
self.started_at = datetime.utcnow()
self.phase_started_at = datetime.utcnow()
def get_current_state(self):
phase = self.phases[self.current_phase]
elapsed = (datetime.utcnow() - self.phase_started_at).total_seconds() / 3600
return {
"experiment": self.experiment,
"current_phase": self.current_phase + 1,
"total_phases": len(self.phases),
"traffic_pct": phase["pct"],
"elapsed_hours": round(elapsed, 1),
"required_hours": phase["duration_hours"],
"auto_advance": phase["auto"],
"ready_to_advance": elapsed >= phase["duration_hours"],
}
def check_guardrails(self, metrics):
"""Check if guardrail metrics are within thresholds"""
guardrails = {
"error_rate_pct": {"threshold": 1.0, "operator": "<="},
"latency_p99_ms": {"threshold": 200, "operator": "<="},
"model_accuracy": {"threshold": 0.85, "operator": ">="},
}
violations = []
for metric_name, rule in guardrails.items():
value = metrics.get(metric_name)
if value is None:
continue
if rule["operator"] == "<=" and value > rule["threshold"]:
violations.append({
"metric": metric_name,
"value": value,
"threshold": rule["threshold"],
"operator": rule["operator"],
})
elif rule["operator"] == ">=" and value < rule["threshold"]:
violations.append({
"metric": metric_name,
"value": value,
"threshold": rule["threshold"],
"operator": rule["operator"],
})
return {
"passed": len(violations) == 0,
"violations": violations,
"action": "continue" if not violations else "rollback",
}
def advance_phase(self, metrics):
"""Attempt to advance to next phase"""
guardrail_check = self.check_guardrails(metrics)
if not guardrail_check["passed"]:
return {
"action": "rollback",
"reason": "Guardrail violations detected",
"violations": guardrail_check["violations"],
}
state = self.get_current_state()
if not state["ready_to_advance"]:
return {
"action": "wait",
"remaining_hours": state["required_hours"] - state["elapsed_hours"],
}
if self.current_phase < len(self.phases) - 1:
self.current_phase += 1
self.phase_started_at = datetime.utcnow()
new_phase = self.phases[self.current_phase]
return {
"action": "advanced",
"new_phase": self.current_phase + 1,
"new_traffic_pct": new_phase["pct"],
"requires_approval": not new_phase["auto"],
}
return {"action": "complete", "message": "Full rollout achieved"}
def rollback(self):
"""Rollback to champion model"""
return {
"action": "rollback",
"from_phase": self.current_phase + 1,
"traffic_reverted_to": "100% champion (v1)",
"timestamp": datetime.utcnow().isoformat(),
}
pd = MLProgressiveDelivery("recommendation_model_v2")
print("State:", json.dumps(pd.get_current_state(), indent=2))
good_metrics = {"error_rate_pct": 0.5, "latency_p99_ms": 150, "model_accuracy": 0.92}
guardrails = pd.check_guardrails(good_metrics)
print("Guardrails:", json.dumps(guardrails, indent=2))
bad_metrics = {"error_rate_pct": 3.0, "latency_p99_ms": 350, "model_accuracy": 0.80}
bad_check = pd.check_guardrails(bad_metrics)
print("Bad Check:", json.dumps(bad_check, indent=2))
Statistical Analysis สำหรับ A/B Tests
วิเคราะห์ผล A/B test ทางสถิติ
#!/usr/bin/env python3
# ab_statistics.py — A/B Test Statistical Analysis
import json
import math
import logging
from typing import Dict, Tuple
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger("stats")
class ABTestAnalyzer:
def __init__(self, confidence_level=0.95):
self.confidence = confidence_level
self.z_score = 1.96 # for 95% confidence
def calculate_sample_size(self, baseline_rate, mde, power=0.80):
"""Calculate required sample size per variant"""
z_alpha = self.z_score
z_beta = 0.84 # for 80% power
p1 = baseline_rate
p2 = baseline_rate + mde
p_avg = (p1 + p2) / 2
numerator = (z_alpha * math.sqrt(2 * p_avg * (1 - p_avg)) +
z_beta * math.sqrt(p1 * (1 - p1) + p2 * (1 - p2))) ** 2
denominator = (p2 - p1) ** 2
n = math.ceil(numerator / denominator)
return {
"baseline_rate": baseline_rate,
"minimum_detectable_effect": mde,
"confidence_level": self.confidence,
"power": power,
"sample_size_per_variant": n,
"total_sample_size": n * 2,
}
def analyze_proportions(self, control_conversions, control_total,
treatment_conversions, treatment_total):
"""Two-proportion z-test for conversion rates"""
p_control = control_conversions / control_total
p_treatment = treatment_conversions / treatment_total
p_pooled = (control_conversions + treatment_conversions) / (control_total + treatment_total)
se = math.sqrt(p_pooled * (1 - p_pooled) * (1/control_total + 1/treatment_total))
if se == 0:
z_stat = 0
else:
z_stat = (p_treatment - p_control) / se
# Two-tailed p-value approximation
p_value = 2 * (1 - self._normal_cdf(abs(z_stat)))
# Confidence interval for difference
se_diff = math.sqrt(
p_control * (1 - p_control) / control_total +
p_treatment * (1 - p_treatment) / treatment_total
)
diff = p_treatment - p_control
ci_lower = diff - self.z_score * se_diff
ci_upper = diff + self.z_score * se_diff
relative_lift = (p_treatment - p_control) / p_control * 100 if p_control > 0 else 0
return {
"control": {
"conversions": control_conversions,
"total": control_total,
"rate": round(p_control, 4),
},
"treatment": {
"conversions": treatment_conversions,
"total": treatment_total,
"rate": round(p_treatment, 4),
},
"difference": round(diff, 4),
"relative_lift_pct": round(relative_lift, 2),
"z_statistic": round(z_stat, 4),
"p_value": round(p_value, 4),
"confidence_interval": [round(ci_lower, 4), round(ci_upper, 4)],
"statistically_significant": p_value < (1 - self.confidence),
"recommendation": "deploy_treatment" if (p_value < 0.05 and diff > 0) else
"keep_control" if (p_value < 0.05 and diff < 0) else "continue_test",
}
def _normal_cdf(self, x):
"""Approximation of normal CDF"""
return 0.5 * (1 + math.erf(x / math.sqrt(2)))
def sequential_test(self, data_points, spending_function="obrien_fleming"):
"""Sequential testing with alpha spending"""
n = len(data_points)
max_n = 10000
info_fraction = n / max_n
if spending_function == "obrien_fleming":
alpha_spent = 2 * (1 - self._normal_cdf(self.z_score / math.sqrt(info_fraction)))
else:
alpha_spent = (1 - self.confidence) * info_fraction
return {
"current_samples": n,
"max_samples": max_n,
"info_fraction": round(info_fraction, 4),
"alpha_spent": round(alpha_spent, 6),
"can_stop_early": alpha_spent > 0.01,
}
analyzer = ABTestAnalyzer(confidence_level=0.95)
sample = analyzer.calculate_sample_size(baseline_rate=0.05, mde=0.005)
print("Sample Size:", json.dumps(sample, indent=2))
result = analyzer.analyze_proportions(
control_conversions=520, control_total=10000,
treatment_conversions=580, treatment_total=10000
)
print("Analysis:", json.dumps(result, indent=2))
CI/CD Pipeline สำหรับ ML A/B Testing
Automate ML A/B testing ใน CI/CD
# === CI/CD Pipeline for ML A/B Testing ===
# 1. GitHub Actions — ML Model A/B Deploy
cat > .github/workflows/ml-ab-deploy.yml << 'EOF'
name: ML Model A/B Deploy
on:
workflow_dispatch:
inputs:
model_version:
description: 'New model version to test'
required: true
initial_traffic_pct:
description: 'Initial traffic percentage (1-100)'
required: true
default: '5'
jobs:
validate-model:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Download Model Artifact
run: |
aws s3 cp s3://models/}/model.tar.gz .
tar xzf model.tar.gz
- name: Run Offline Validation
run: |
python3 scripts/validate_model.py \
--model-path ./model \
--test-data s3://data/test_set.parquet \
--min-accuracy 0.85 \
--max-latency-ms 100
deploy-canary:
needs: validate-model
runs-on: ubuntu-latest
steps:
- name: Deploy Model as Canary
run: |
kubectl set image deployment/ml-model-v2 \
model=registry/ml-model:}
kubectl rollout status deployment/ml-model-v2 --timeout=300s
- name: Set Traffic Split
run: |
cat < k8s/argo-rollout.yaml << 'EOF'
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: ml-model-rollout
spec:
replicas: 4
strategy:
canary:
canaryService: ml-model-canary
stableService: ml-model-stable
trafficRouting:
istio:
virtualService:
name: ml-model-vs
steps:
- setWeight: 5
- pause: {duration: 24h}
- analysis:
templates:
- templateName: ml-model-analysis
- setWeight: 20
- pause: {duration: 48h}
- analysis:
templates:
- templateName: ml-model-analysis
- setWeight: 50
- pause: {} # Manual approval
- setWeight: 100
template:
spec:
containers:
- name: model
image: registry/ml-model:latest
ports:
- containerPort: 8080
EOF
echo "CI/CD pipeline configured"
Monitoring และ Decision Making
Monitor A/B test และตัดสินใจ
#!/usr/bin/env python3
# ab_monitor.py — A/B Test Monitoring and Decision
import json
import logging
from datetime import datetime
from typing import Dict
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger("monitor")
class ABTestMonitor:
def __init__(self):
self.experiments = {}
def collect_metrics(self, experiment_name):
"""Collect real-time metrics for both variants"""
return {
"experiment": experiment_name,
"timestamp": datetime.utcnow().isoformat(),
"control": {
"requests": 80000,
"predictions": 79500,
"errors": 400,
"avg_latency_ms": 45,
"p99_latency_ms": 120,
"accuracy": 0.89,
"ctr": 0.052,
"conversion_rate": 0.031,
"revenue_per_user": 2.45,
},
"treatment": {
"requests": 20000,
"predictions": 19900,
"errors": 80,
"avg_latency_ms": 52,
"p99_latency_ms": 145,
"accuracy": 0.92,
"ctr": 0.058,
"conversion_rate": 0.035,
"revenue_per_user": 2.78,
},
}
def make_decision(self, metrics):
"""Automated decision engine"""
control = metrics["control"]
treatment = metrics["treatment"]
# Guardrail checks
guardrail_ok = True
issues = []
if treatment["p99_latency_ms"] > 200:
guardrail_ok = False
issues.append(f"Latency too high: {treatment['p99_latency_ms']}ms")
error_rate = treatment["errors"] / max(treatment["requests"], 1) * 100
if error_rate > 1.0:
guardrail_ok = False
issues.append(f"Error rate too high: {error_rate:.2f}%")
if not guardrail_ok:
return {"decision": "rollback", "reason": issues}
# Performance comparison
ctr_lift = (treatment["ctr"] - control["ctr"]) / control["ctr"] * 100
conv_lift = (treatment["conversion_rate"] - control["conversion_rate"]) / control["conversion_rate"] * 100
rev_lift = (treatment["revenue_per_user"] - control["revenue_per_user"]) / control["revenue_per_user"] * 100
min_samples = 10000
has_enough_data = treatment["requests"] >= min_samples
return {
"decision": "advance" if (ctr_lift > 0 and has_enough_data) else "wait",
"guardrails_passed": guardrail_ok,
"has_sufficient_data": has_enough_data,
"lifts": {
"ctr_lift_pct": round(ctr_lift, 2),
"conversion_lift_pct": round(conv_lift, 2),
"revenue_lift_pct": round(rev_lift, 2),
},
"treatment_samples": treatment["requests"],
"min_samples_required": min_samples,
}
monitor = ABTestMonitor()
metrics = monitor.collect_metrics("recommendation_model_v2")
decision = monitor.make_decision(metrics)
print("Metrics Summary:")
print(f" Control CTR: {metrics['control']['ctr']}")
print(f" Treatment CTR: {metrics['treatment']['ctr']}")
print(f"\nDecision: {json.dumps(decision, indent=2)}")
FAQ คำถามที่พบบ่อย
Q: A/B Testing กับ Shadow Testing ต่างกันอย่างไร?
A: A/B Testing ส่ง traffic จริงไปยัง model ใหม่ ผู้ใช้เห็นผลลัพธ์จาก model ใหม่จริง วัด business metrics ได้ (CTR, conversion, revenue) มีความเสี่ยงถ้า model ใหม่ perform ไม่ดี Shadow Testing (Dark Launch) ส่ง traffic ไปทั้ง model เก่าและใหม่พร้อมกัน แต่ผู้ใช้เห็นผลลัพธ์จาก model เก่าเท่านั้น ผลจาก model ใหม่ถูก log ไว้เปรียบเทียบ ไม่มี risk ต่อผู้ใช้ แนะนำทำ Shadow Testing ก่อน แล้วค่อยทำ A/B Test
Q: ต้องมี traffic เท่าไหร่ถึงจะได้ผลที่เชื่อถือได้?
A: ขึ้นกับ baseline conversion rate และ Minimum Detectable Effect (MDE) สูตรคร่าวๆ ถ้า baseline CTR 5% ต้องการตรวจจับ improvement 10% relative (0.5% absolute) ต้องมี sample size ประมาณ 30,000-50,000 per variant ที่ 95% confidence, 80% power ถ้า baseline CTR 1% ต้องมีมากขึ้น 150,000+ per variant ใช้ calculator หรือ function calculate_sample_size ในบทความนี้ อย่า stop test ก่อนถึง required sample size เพราะจะได้ผลที่ไม่ reliable
Q: Progressive Delivery กับ Canary Release ต่างกันไหม?
A: Canary Release เป็นส่วนหนึ่งของ Progressive Delivery Canary Release เน้นการ deploy version ใหม่ให้ subset เล็กๆ ก่อน ตรวจสอบว่าไม่มี errors แล้วค่อย rollout ทั้งหมด Progressive Delivery ครอบคลุมมากกว่า รวม canary, A/B testing, feature flags, automated analysis, automated rollback เป็น framework ทั้งหมดที่จัดการ gradual rollout สำหรับ ML models แนะนำ Progressive Delivery เพราะมี statistical analysis ช่วยตัดสินใจ
Q: เครื่องมือที่แนะนำสำหรับ ML A/B Testing?
A: Argo Rollouts open source progressive delivery controller สำหรับ Kubernetes รองรับ canary, blue-green, A/B testing มี analysis templates, Flagger อีก open source option สำหรับ Kubernetes ทำงานกับ Istio, Linkerd, App Mesh, LaunchDarkly feature flag platform รองรับ ML experiments, Optimizely enterprise A/B testing platform, Statsig ML-focused experimentation platform สำหรับ self-hosted แนะนำ Argo Rollouts + custom analysis scripts สำหรับ managed service แนะนำ Statsig หรือ LaunchDarkly