SonarQube ML Pipeline
SonarQube Machine Learning Pipeline Code Quality Bug Vulnerability Coverage Data Validation Feature Engineering Model Training Serving CI/CD
| Pipeline Stage | Code Type | SonarQube Checks | Coverage Target |
|---|---|---|---|
| Data Ingestion | Python ETL | Bug, Exception Handling, Null | > 80% |
| Feature Engineering | Python/Spark | Data Leak, Null, Duplication | > 70% |
| Model Training | Python ML | Reproducibility, Hardcode | > 50% |
| Model Serving API | FastAPI/Flask | Vulnerability, Auth, Input | > 80% |
| Pipeline Orchestration | Airflow/Dagster | DAG, Retry, Timeout | > 70% |
ML Code Quality Rules
# === ML-specific Quality Rules ===
from dataclasses import dataclass
@dataclass
class MLQualityRule:
rule_id: str
name: str
severity: str
category: str
bad_example: str
good_example: str
rules = [
MLQualityRule("ML001",
"ห้าม Hardcode File Path",
"MAJOR",
"Data Ingestion",
"df = pd.read_csv('/home/user/data/train.csv')",
"df = pd.read_csv(config.DATA_PATH / 'train.csv')"),
MLQualityRule("ML002",
"ต้องตั้ง Random Seed",
"CRITICAL",
"Model Training",
"model = RandomForestClassifier()",
"model = RandomForestClassifier(random_state=config.SEED)"),
MLQualityRule("ML003",
"ห้าม print() ใน Production Code",
"MINOR",
"All",
"print(f'Training loss: {loss}')",
"logger.info(f'Training loss: {loss}')"),
MLQualityRule("ML004",
"ต้อง Validate Input ใน API",
"CRITICAL",
"Model Serving",
"prediction = model.predict(request.json['data'])",
"validated = InputSchema(**request.json); prediction = model.predict(validated.data)"),
MLQualityRule("ML005",
"ห้าม Data Leak (ใช้ Test Data ก่อน Split)",
"BLOCKER",
"Feature Engineering",
"scaler.fit(all_data); X_train = scaler.transform(X_train)",
"scaler.fit(X_train); X_test = scaler.transform(X_test)"),
MLQualityRule("ML006",
"ต้อง Handle Missing Values",
"MAJOR",
"Feature Engineering",
"features = df[['col1', 'col2']].values",
"features = df[['col1', 'col2']].fillna(strategy).values"),
]
print("=== ML Quality Rules ===")
for r in rules:
print(f"\n [{r.rule_id}] {r.name} ({r.severity})")
print(f" Category: {r.category}")
print(f" Bad: {r.bad_example}")
print(f" Good: {r.good_example}")
CI/CD Integration
# === ML Pipeline CI/CD with SonarQube ===
# GitHub Actions for ML Pipeline
# name: ML Pipeline Quality
# on: [pull_request]
# jobs:
# quality:
# runs-on: ubuntu-latest
# steps:
# - uses: actions/checkout@v4
# - name: Setup Python
# uses: actions/setup-python@v5
# with: { python-version: '3.10' }
# - name: Install Dependencies
# run: pip install -r requirements.txt
# - name: Run Tests with Coverage
# run: pytest tests/ --cov=src --cov-report=xml
# - name: Data Validation Tests
# run: great_expectations checkpoint run ml_data_check
# - name: SonarQube Scan
# uses: SonarSource/sonarqube-scan-action@v2
# env:
# SONAR_TOKEN: }
# - name: Quality Gate
# uses: SonarSource/sonarqube-quality-gate-action@v1
@dataclass
class PipelineStage:
stage: str
tools: str
checks: str
gate: str
stages = [
PipelineStage("1. PR Quality Check",
"SonarQube + pytest + Great Expectations",
"Code Quality, Unit Tests, Data Schema Validation",
"Quality Gate Pass + Tests Pass + Data Valid"),
PipelineStage("2. Integration Test",
"pytest + MLflow + Docker",
"End-to-end Pipeline Test, Model Training Smoke Test",
"All Tests Pass, Model Metrics > Baseline"),
PipelineStage("3. Model Training",
"MLflow + DVC + SonarQube",
"Data Quality, Training Metrics, Model Validation",
"Accuracy > Threshold, No Data Drift"),
PipelineStage("4. Model Registry",
"MLflow Model Registry",
"Model Artifact, Metrics, Lineage, Version",
"Model Approved by Reviewer"),
PipelineStage("5. Staging Deploy",
"Kubernetes + Seldon/BentoML",
"API Tests, Load Test, Security Scan (DAST)",
"Response Time < 100ms, No Vulnerabilities"),
PipelineStage("6. Production Deploy",
"Kubernetes + Canary/Blue-Green",
"Canary Metrics, A/B Test, Model Monitor",
"No Degradation in Business Metrics"),
]
print("=== ML CI/CD Pipeline ===")
for s in stages:
print(f"\n [{s.stage}]")
print(f" Tools: {s.tools}")
print(f" Checks: {s.checks}")
print(f" Gate: {s.gate}")
Testing Strategy
# === ML Testing Strategy ===
@dataclass
class TestCategory:
category: str
what_to_test: str
tools: str
coverage_target: str
test_categories = [
TestCategory("Unit Tests (Data)",
"Data Loading, Transformation, Validation, Encoding",
"pytest + pandas testing + Great Expectations",
"> 80% Coverage"),
TestCategory("Unit Tests (Feature)",
"Feature Engineering, Scaling, Selection, Encoding",
"pytest + numpy testing",
"> 70% Coverage"),
TestCategory("Unit Tests (Model)",
"Model Init, Predict Shape, Save/Load, Config",
"pytest + sklearn metrics",
"> 50% Coverage (Training ยากทดสอบ)"),
TestCategory("Integration Tests",
"End-to-end Pipeline, Data→Feature→Model→Predict",
"pytest + Docker + Sample Data",
"> 60% Coverage"),
TestCategory("API Tests",
"Endpoint, Input Validation, Error Handling, Auth",
"pytest + httpx + FastAPI TestClient",
"> 80% Coverage"),
TestCategory("Data Quality Tests",
"Schema, Null, Range, Distribution, Freshness",
"Great Expectations + Monte Carlo",
"100% Critical Checks"),
]
print("=== ML Testing Strategy ===")
for t in test_categories:
print(f" [{t.category}]")
print(f" Test: {t.what_to_test}")
print(f" Tools: {t.tools}")
print(f" Target: {t.coverage_target}")
เคล็ดลับ
- Coverage: ML Training Code Coverage ต่ำกว่า Web App เป็นปกติ ตั้ง 50-70%
- Custom Rules: สร้าง Custom Rules สำหรับ ML เช่น Data Leak Random Seed
- Notebook: แปลง Notebook เป็น .py ก่อน Scan ด้วย nbconvert
- Profile: แยก Quality Profile สำหรับ ML Code กับ App Code
- Data Test: ใช้ Great Expectations ตรวจ Data Quality ควบคู่ Code Quality
การนำไปใช้งานจริงในองค์กร
สำหรับองค์กรขนาดกลางถึงใหญ่ แนะนำให้ใช้หลัก Three-Tier Architecture คือ Core Layer ที่เป็นแกนกลางของระบบ Distribution Layer ที่ทำหน้าที่กระจาย Traffic และ Access Layer ที่เชื่อมต่อกับผู้ใช้โดยตรง การแบ่ง Layer ชัดเจนช่วยให้การ Troubleshoot ง่ายขึ้นและสามารถ Scale ระบบได้ตามความต้องการ
เรื่อง Network Security ก็สำคัญไม่แพ้กัน ควรติดตั้ง Next-Generation Firewall ที่สามารถ Deep Packet Inspection ได้ ใช้ Network Segmentation แยก VLAN สำหรับแต่ละแผนก ติดตั้ง IDS/IPS เพื่อตรวจจับการโจมตี และทำ Regular Security Audit อย่างน้อยปีละ 2 ครั้ง
ทำไมต้องใช้ SonarQube กับ ML Pipeline
ML Code Bug Vulnerability Technical Debt Data Ingestion Feature Engineering Training Serving Pipeline Quality Maintainable Secure Coverage
ตรวจอะไรใน ML Code
Data Validation Exception Feature Data Leak Null Hyperparameter Random Seed Model Save API Input Auth Rate Limit Pipeline Retry Timeout Idempotent
Quality Gate ตั้งอย่างไร
Coverage 70% ML 80% API Bug 0 Vulnerability 0 Duplication 5% Hotspots Reviewed Custom Rules Profile Notebook nbconvert
CI/CD Pipeline ทำอย่างไร
PR SonarQube pytest Great Expectations Integration MLflow Training DVC Model Registry Staging DAST Production Canary Monitor GitHub Actions
สรุป
SonarQube ML Pipeline Code Quality Custom Rules Data Leak Random Seed Coverage API Security Great Expectations MLflow CI/CD Production
