A/B Testing ????????? Machine Learning ?????????????????????
A/B Testing ????????? ??????????????????????????????????????????????????? (controlled experiment) ????????????????????????????????? 2 ????????????????????????????????? variants ?????????????????????????????? variant ????????????????????????????????????????????????????????? ???????????????????????????????????????????????? web optimization, product development, marketing campaigns
Machine Learning ???????????? A/B Testing ?????????????????????????????? Multi-Armed Bandits (MAB) ?????????????????? traffic ?????????????????????????????????????????? variant ??????????????????????????? ?????? regret, Bayesian optimization ?????? optimal parameters ???????????????????????? grid search, Causal inference ???????????????????????????????????????????????????????????????????????? treatment, Automated sample size calculation ????????? ML predict effect size, Heterogeneous treatment effects ???????????????????????????????????? treatment ????????????????????????????????????????????????????????? segment
Home Lab Setup ????????????????????? ??????????????? A/B testing platform ???????????????????????????????????????????????? ??????????????????????????????????????????, ??????????????? algorithms, ????????? prototype ???????????? deploy production ????????? Docker, Python, Redis, PostgreSQL ???????????????????????????????????? local machine
????????????????????? Home Lab ?????????????????? A/B Testing
Setup infrastructure ?????????????????? A/B testing platform
# === A/B Testing Home Lab Setup ===
# 1. Docker Compose Stack
cat > docker-compose.yml << 'EOF'
version: '3.8'
services:
# PostgreSQL (experiment data)
postgres:
image: postgres:16-alpine
ports:
- "5432:5432"
environment:
POSTGRES_DB: ab_testing
POSTGRES_USER: abtest
POSTGRES_PASSWORD: password123
volumes:
- pgdata:/var/lib/postgresql/data
- ./init.sql:/docker-entrypoint-initdb.d/init.sql
# Redis (feature flags & assignment cache)
redis:
image: redis:7-alpine
ports:
- "6379:6379"
# Jupyter Lab (analysis)
jupyter:
image: jupyter/scipy-notebook:latest
ports:
- "8888:8888"
volumes:
- ./notebooks:/home/jovyan/work
environment:
JUPYTER_TOKEN: "abtest123"
# GrowthBook (open-source A/B testing platform)
growthbook:
image: growthbook/growthbook:latest
ports:
- "3000:3000"
- "3100:3100"
environment:
- MONGODB_URI=mongodb://mongo:27017/growthbook
- APP_ORIGIN=http://localhost:3000
- API_HOST=http://localhost:3100
depends_on:
- mongo
mongo:
image: mongo:7
volumes:
- mongodata:/data/db
volumes:
pgdata:
mongodata:
EOF
# 2. Database Schema
cat > init.sql << 'EOF'
CREATE TABLE experiments (
id SERIAL PRIMARY KEY,
name VARCHAR(100) NOT NULL,
description TEXT,
status VARCHAR(20) DEFAULT 'draft',
start_date TIMESTAMP,
end_date TIMESTAMP,
created_at TIMESTAMP DEFAULT NOW()
);
CREATE TABLE variants (
id SERIAL PRIMARY KEY,
experiment_id INT REFERENCES experiments(id),
name VARCHAR(50) NOT NULL,
weight FLOAT DEFAULT 0.5,
is_control BOOLEAN DEFAULT FALSE
);
CREATE TABLE assignments (
id SERIAL PRIMARY KEY,
experiment_id INT REFERENCES experiments(id),
variant_id INT REFERENCES variants(id),
user_id VARCHAR(100) NOT NULL,
assigned_at TIMESTAMP DEFAULT NOW()
);
CREATE TABLE events (
id SERIAL PRIMARY KEY,
experiment_id INT REFERENCES experiments(id),
variant_id INT REFERENCES variants(id),
user_id VARCHAR(100) NOT NULL,
event_type VARCHAR(50) NOT NULL,
value FLOAT,
created_at TIMESTAMP DEFAULT NOW()
);
CREATE INDEX idx_assignments_user ON assignments(user_id, experiment_id);
CREATE INDEX idx_events_experiment ON events(experiment_id, variant_id);
EOF
# 3. Start Stack
docker compose up -d
echo "Home lab ready at:"
echo " Jupyter: http://localhost:8888 (token: abtest123)"
echo " GrowthBook: http://localhost:3000"
echo " PostgreSQL: localhost:5432"
??????????????? ML-Powered A/B Testing System
Python A/B testing engine
#!/usr/bin/env python3
# ab_engine.py ??? ML-Powered A/B Testing Engine
import json
import logging
import hashlib
import math
import random
from typing import Dict, List, Optional
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger("abtest")
class ABTestEngine:
"""A/B Testing Engine with ML capabilities"""
def __init__(self):
self.experiments = {}
self.assignments = {}
self.events = {}
def create_experiment(self, name, variants, traffic_pct=100):
"""Create new experiment"""
self.experiments[name] = {
"name": name,
"variants": variants,
"traffic_pct": traffic_pct,
"status": "running",
}
self.events[name] = {v: {"impressions": 0, "conversions": 0} for v in variants}
return self.experiments[name]
def assign_variant(self, experiment_name, user_id):
"""Deterministic variant assignment using hash"""
exp = self.experiments.get(experiment_name)
if not exp or exp["status"] != "running":
return None
# Check if already assigned
key = f"{experiment_name}:{user_id}"
if key in self.assignments:
return self.assignments[key]
# Deterministic hash-based assignment
hash_val = int(hashlib.md5(key.encode()).hexdigest(), 16)
# Traffic allocation
if hash_val % 100 >= exp["traffic_pct"]:
return None
# Variant selection based on weights
variants = exp["variants"]
total_weight = sum(v.get("weight", 1) for v in variants.values())
threshold = (hash_val % 1000) / 1000 * total_weight
cumulative = 0
selected = None
for name, config in variants.items():
cumulative += config.get("weight", 1)
if threshold < cumulative:
selected = name
break
self.assignments[key] = selected
return selected
def record_event(self, experiment_name, user_id, event_type, value=1):
"""Record conversion event"""
variant = self.assignments.get(f"{experiment_name}:{user_id}")
if not variant:
return
events = self.events[experiment_name][variant]
if event_type == "impression":
events["impressions"] += 1
elif event_type == "conversion":
events["conversions"] += 1
def analyze(self, experiment_name):
"""Statistical analysis of experiment"""
events = self.events.get(experiment_name, {})
results = {}
for variant, data in events.items():
n = data["impressions"]
x = data["conversions"]
p = x / n if n > 0 else 0
se = math.sqrt(p * (1 - p) / n) if n > 0 else 0
results[variant] = {
"impressions": n,
"conversions": x,
"conversion_rate": round(p * 100, 3),
"std_error": round(se * 100, 3),
"ci_95": [round((p - 1.96*se) * 100, 3), round((p + 1.96*se) * 100, 3)],
}
# Statistical significance between first two variants
variants = list(results.keys())
if len(variants) >= 2:
a = results[variants[0]]
b = results[variants[1]]
p_a = a["conversions"] / a["impressions"] if a["impressions"] > 0 else 0
p_b = b["conversions"] / b["impressions"] if b["impressions"] > 0 else 0
n_a = a["impressions"]
n_b = b["impressions"]
if n_a > 0 and n_b > 0:
p_pool = (a["conversions"] + b["conversions"]) / (n_a + n_b)
se = math.sqrt(p_pool * (1 - p_pool) * (1/n_a + 1/n_b)) if p_pool > 0 and p_pool < 1 else 1
z = (p_b - p_a) / se if se > 0 else 0
results["significance"] = {
"z_score": round(z, 3),
"significant": abs(z) > 1.96,
"lift": round((p_b - p_a) / p_a * 100, 2) if p_a > 0 else 0,
"winner": variants[1] if z > 1.96 else variants[0] if z < -1.96 else "none",
}
return results
# Demo
engine = ABTestEngine()
engine.create_experiment("checkout_button", {
"control": {"weight": 1, "description": "Blue button"},
"treatment": {"weight": 1, "description": "Green button"},
})
# Simulate traffic
random.seed(42)
for i in range(2000):
user = f"user_{i}"
variant = engine.assign_variant("checkout_button", user)
if variant:
engine.record_event("checkout_button", user, "impression")
cvr = 0.10 if variant == "control" else 0.13
if random.random() < cvr:
engine.record_event("checkout_button", user, "conversion")
results = engine.analyze("checkout_button")
print("Experiment: checkout_button")
for variant in ["control", "treatment"]:
r = results[variant]
print(f" {variant}: CVR={r['conversion_rate']}%, CI={r['ci_95']}, n={r['impressions']}")
sig = results.get("significance", {})
print(f" Lift: {sig.get('lift', 0)}%, Significant: {sig.get('significant', False)}, Winner: {sig.get('winner', 'N/A')}")
Multi-Armed Bandit Algorithms
ML algorithms ?????????????????? adaptive testing
#!/usr/bin/env python3
# bandits.py ??? Multi-Armed Bandit Algorithms
import json
import logging
import random
import math
from typing import Dict, List
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger("bandit")
class EpsilonGreedy:
"""Epsilon-Greedy Bandit"""
def __init__(self, n_arms, epsilon=0.1):
self.n_arms = n_arms
self.epsilon = epsilon
self.counts = [0] * n_arms
self.values = [0.0] * n_arms
def select_arm(self):
if random.random() < self.epsilon:
return random.randint(0, self.n_arms - 1)
return self.values.index(max(self.values))
def update(self, arm, reward):
self.counts[arm] += 1
n = self.counts[arm]
self.values[arm] = ((n - 1) * self.values[arm] + reward) / n
class ThompsonSampling:
"""Thompson Sampling (Bayesian Bandit)"""
def __init__(self, n_arms):
self.n_arms = n_arms
self.alpha = [1] * n_arms # Successes + 1
self.beta = [1] * n_arms # Failures + 1
def select_arm(self):
samples = [random.betavariate(self.alpha[i], self.beta[i]) for i in range(self.n_arms)]
return samples.index(max(samples))
def update(self, arm, reward):
if reward > 0:
self.alpha[arm] += 1
else:
self.beta[arm] += 1
def get_probabilities(self):
"""Probability each arm is best"""
n_sim = 10000
wins = [0] * self.n_arms
for _ in range(n_sim):
samples = [random.betavariate(self.alpha[i], self.beta[i]) for i in range(self.n_arms)]
wins[samples.index(max(samples))] += 1
return [round(w / n_sim * 100, 1) for w in wins]
class UCB1:
"""Upper Confidence Bound"""
def __init__(self, n_arms):
self.n_arms = n_arms
self.counts = [0] * n_arms
self.values = [0.0] * n_arms
self.total = 0
def select_arm(self):
for i in range(self.n_arms):
if self.counts[i] == 0:
return i
ucb_values = [
self.values[i] + math.sqrt(2 * math.log(self.total) / self.counts[i])
for i in range(self.n_arms)
]
return ucb_values.index(max(ucb_values))
def update(self, arm, reward):
self.counts[arm] += 1
self.total += 1
n = self.counts[arm]
self.values[arm] = ((n - 1) * self.values[arm] + reward) / n
# Compare algorithms
true_rates = [0.10, 0.13, 0.08] # True conversion rates
n_rounds = 5000
algorithms = {
"Epsilon-Greedy": EpsilonGreedy(3, epsilon=0.1),
"Thompson Sampling": ThompsonSampling(3),
"UCB1": UCB1(3),
}
random.seed(42)
for name, algo in algorithms.items():
total_reward = 0
for _ in range(n_rounds):
arm = algo.select_arm()
reward = 1 if random.random() < true_rates[arm] else 0
algo.update(arm, reward)
total_reward += reward
avg_reward = total_reward / n_rounds
print(f"{name}: Total reward={total_reward}, Avg={avg_reward:.4f}")
if hasattr(algo, 'counts'):
print(f" Arm pulls: {algo.counts}")
if hasattr(algo, 'get_probabilities'):
print(f" Win probabilities: {algo.get_probabilities()}")
Statistical Analysis Pipeline
Pipeline ???????????????????????????????????????????????????????????????????????????
# === Statistical Analysis Pipeline ===
cat > analysis_pipeline.yaml << 'EOF'
ab_test_analysis:
pre_experiment:
sample_size:
formula: "n = (Z_??/2 + Z_??)?? ?? (p???(1-p???) + p???(1-p???)) / (p??? - p???)??"
parameters:
alpha: 0.05
power: 0.80
baseline_cvr: 0.10
minimum_detectable_effect: 0.02
calculated_n_per_variant: 3842
duration:
daily_traffic: 1000
variants: 2
estimated_days: 8
during_experiment:
guardrails:
- "Do not peek at results before sample size reached"
- "Monitor for Sample Ratio Mismatch (SRM)"
- "Check for novelty/primacy effects"
- "Monitor key guardrail metrics (error rate, latency)"
srm_check:
description: "Chi-squared test for equal split"
threshold: "p-value < 0.001 indicates SRM"
action: "Stop experiment, investigate assignment bug"
post_experiment:
primary_analysis:
- "Z-test for proportions (binary outcomes)"
- "t-test for continuous outcomes"
- "Mann-Whitney U for non-normal distributions"
secondary_analysis:
- "Segment analysis (by device, country, user type)"
- "Heterogeneous Treatment Effects (CATE)"
- "Regression adjustment for covariates"
decision_framework:
significant_positive: "Ship treatment"
significant_negative: "Keep control"
not_significant: "Need more data or accept null"
multiple_testing:
correction: "Bonferroni or Benjamini-Hochberg"
note: "Adjust alpha when testing multiple metrics"
EOF
python3 -c "
import math
# Sample size calculator
def sample_size(baseline, mde, alpha=0.05, power=0.80):
z_alpha = 1.96 # two-sided
z_beta = 0.84 # 80% power
p1 = baseline
p2 = baseline + mde
n = ((z_alpha + z_beta)**2 * (p1*(1-p1) + p2*(1-p2))) / (p2 - p1)**2
return math.ceil(n)
# Examples
examples = [
(0.10, 0.02, 'CVR 10% ??? 12% (20% lift)'),
(0.10, 0.01, 'CVR 10% ??? 11% (10% lift)'),
(0.05, 0.01, 'CVR 5% ??? 6% (20% lift)'),
(0.02, 0.005, 'CVR 2% ??? 2.5% (25% lift)'),
]
print('Sample Size Calculator (per variant, 95% confidence, 80% power):')
for baseline, mde, desc in examples:
n = sample_size(baseline, mde)
print(f' {desc}: n = {n:,}')
"
echo "Analysis pipeline configured"
Monitoring ????????? Dashboard
????????????????????????????????????????????????
#!/usr/bin/env python3
# experiment_dashboard.py ??? Experiment Monitoring Dashboard
import json
import logging
from typing import Dict, List
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger("dashboard")
class ExperimentDashboard:
def __init__(self):
pass
def overview(self):
return {
"active_experiments": 5,
"completed_this_month": 8,
"win_rate": "37.5% (3/8 had significant positive results)",
"experiments": [
{
"name": "checkout_green_button",
"status": "running",
"days_active": 7,
"traffic": 12500,
"control_cvr": 10.2,
"treatment_cvr": 12.8,
"lift": "+25.5%",
"significant": True,
"recommendation": "Ship treatment",
},
{
"name": "pricing_page_redesign",
"status": "running",
"days_active": 3,
"traffic": 4200,
"control_cvr": 5.1,
"treatment_cvr": 5.4,
"lift": "+5.9%",
"significant": False,
"recommendation": "Continue collecting data",
},
{
"name": "onboarding_flow_v2",
"status": "completed",
"days_active": 14,
"traffic": 28000,
"control_cvr": 22.3,
"treatment_cvr": 25.1,
"lift": "+12.6%",
"significant": True,
"recommendation": "Shipped to 100%",
},
],
"platform_health": {
"assignment_accuracy": "99.98%",
"srm_alerts": 0,
"avg_analysis_latency": "< 1 min",
},
}
dashboard = ExperimentDashboard()
data = dashboard.overview()
print(f"Experiment Dashboard:")
print(f" Active: {data['active_experiments']}, Completed: {data['completed_this_month']}")
print(f" Win rate: {data['win_rate']}")
print(f"\nExperiments:")
for exp in data["experiments"]:
status = "WINNER" if exp["significant"] and exp["status"] == "running" else exp["status"].upper()
print(f" [{status}] {exp['name']}: lift={exp['lift']}, sig={exp['significant']}")
print(f" ??? {exp['recommendation']}")
FAQ ??????????????????????????????????????????
Q: A/B Testing ????????? Multi-Armed Bandit ??????????????????????????????????
A: A/B Testing (Fixed allocation) ???????????? traffic ????????????????????? 50/50 ???????????????????????????????????? ??????????????? statistically valid ????????? sample size ????????????????????????????????? ?????????????????????????????? ????????????????????? rigorous statistical proof, ?????????????????? exploration cost ?????????, ?????? traffic ??????????????? Multi-Armed Bandit (Adaptive allocation) ?????????????????? traffic ??????????????? variant ?????????????????????????????????????????????????????? ?????? opportunity cost (?????????????????????????????? traffic ?????? variant ???????????????????????????) ?????????????????????????????? ????????????????????? maximize revenue ????????????????????????????????????, traffic ???????????????, ?????????????????? statistical rigor ?????????????????????????????? ??????????????? ????????? A/B Testing ?????????????????? product decisions ???????????????????????? (???????????? rigorous) ????????? Bandit ?????????????????? optimization ??????????????????????????? (ad serving, recommendations, pricing)
Q: ?????????????????? traffic ????????????????????????????????????????????? A/B Testing ??????????
A: ????????????????????? baseline conversion rate ????????? minimum detectable effect (MDE) ???????????????????????? CVR 10% ????????????????????? detect 20% lift (10%???12%) ???????????? ~3,850 users ????????? variant, CVR 10% ????????????????????? detect 10% lift (10%???11%) ???????????? ~14,750 users ????????? variant, CVR 2% ????????????????????? detect 25% lift (2%???2.5%) ???????????? ~28,000 users ????????? variant ????????? website ?????? 500 visitors/day ???????????? 15+ ??????????????????????????? case ????????? ????????? traffic ?????????????????????????????? ????????? Bayesian A/B testing (???????????? sample ????????????????????????) ???????????? Bandit algorithms ??????????????????????????? MDE (detect ??????????????? big changes)
Q: Home Lab ??????????????????????????? ????????? SaaS ????????????????????????????
A: SaaS tools ??????????????????????????? GrowthBook (open source, free self-hosted), LaunchDarkly (enterprise feature flags + experiments), Optimizely (web experimentation leader), VWO (visual A/B testing), Google Optimize (?????????????????????????????? ?????????????????? GA4) Home Lab ???????????????????????? ???????????????????????? statistical concepts hands-on, prototype custom algorithms (Bandit, CATE), ??????????????????????????? commit ????????? SaaS, ???????????????????????? license ????????????????????????????????????????????? ?????????????????? production ??????????????? GrowthBook (free, feature-rich) ???????????? SaaS ?????????????????? comfortable ???????????? Home Lab ???????????? learning tool ????????????????????????
Q: Thompson Sampling ?????????????????? Epsilon-Greedy ??????????????????????
A: Epsilon-Greedy ???????????? explore ???????????? probability ?? (???????????? 10%) ???????????????????????? exploit arm ????????????????????????????????? ????????????????????? implement ????????? explore/exploit ratio ???????????????????????? adaptive Thompson Sampling (Bayesian) ????????????????????? posterior distribution ???????????????????????? arm ???????????????????????????????????????????????????????????? uncertainty ???????????? explore ????????????????????????????????????????????? ???????????????????????????????????????????????? scenario ???????????? regret ?????????????????????????????? converge ???????????????????????? UCB1 ???????????? exploration ????????????????????? logarithm ????????? total pulls ?????????????????? Bayesian Thompson Sampling ???????????????????????????????????????????????????????????? production ??????????????? adaptive, ?????? theoretical guarantees, implement ??????????????????