SonarQube ML Pipeline

SonarQube Machine Learning Pipeline Code Quality Bug Vulnerability Coverage Data Validation Feature Engineering Model Training Serving CI/CD

Pipeline StageCode TypeSonarQube ChecksCoverage Target
Data IngestionPython ETLBug, Exception Handling, Null> 80%
Feature EngineeringPython/SparkData Leak, Null, Duplication> 70%
Model TrainingPython MLReproducibility, Hardcode> 50%
Model Serving APIFastAPI/FlaskVulnerability, Auth, Input> 80%
Pipeline OrchestrationAirflow/DagsterDAG, Retry, Timeout> 70%

ML Code Quality Rules

# === ML-specific Quality Rules ===

from dataclasses import dataclass

@dataclass
class MLQualityRule:
 rule_id: str
 name: str
 severity: str
 category: str
 bad_example: str
 good_example: str

rules = [
 MLQualityRule("ML001",
 "ห้าม Hardcode File Path",
 "MAJOR",
 "Data Ingestion",
 "df = pd.read_csv('/home/user/data/train.csv')",
 "df = pd.read_csv(config.DATA_PATH / 'train.csv')"),
 MLQualityRule("ML002",
 "ต้องตั้ง Random Seed",
 "CRITICAL",
 "Model Training",
 "model = RandomForestClassifier()",
 "model = RandomForestClassifier(random_state=config.SEED)"),
 MLQualityRule("ML003",
 "ห้าม print() ใน Production Code",
 "MINOR",
 "All",
 "print(f'Training loss: {loss}')",
 "logger.info(f'Training loss: {loss}')"),
 MLQualityRule("ML004",
 "ต้อง Validate Input ใน API",
 "CRITICAL",
 "Model Serving",
 "prediction = model.predict(request.json['data'])",
 "validated = InputSchema(**request.json); prediction = model.predict(validated.data)"),
 MLQualityRule("ML005",
 "ห้าม Data Leak (ใช้ Test Data ก่อน Split)",
 "BLOCKER",
 "Feature Engineering",
 "scaler.fit(all_data); X_train = scaler.transform(X_train)",
 "scaler.fit(X_train); X_test = scaler.transform(X_test)"),
 MLQualityRule("ML006",
 "ต้อง Handle Missing Values",
 "MAJOR",
 "Feature Engineering",
 "features = df[['col1', 'col2']].values",
 "features = df[['col1', 'col2']].fillna(strategy).values"),
]

print("=== ML Quality Rules ===")
for r in rules:
 print(f"\n [{r.rule_id}] {r.name} ({r.severity})")
 print(f" Category: {r.category}")
 print(f" Bad: {r.bad_example}")
 print(f" Good: {r.good_example}")

CI/CD Integration

# === ML Pipeline CI/CD with SonarQube ===

# GitHub Actions for ML Pipeline
# name: ML Pipeline Quality
# on: [pull_request]
# jobs:
# quality:
# runs-on: ubuntu-latest
# steps:
# - uses: actions/checkout@v4
# - name: Setup Python
# uses: actions/setup-python@v5
# with: { python-version: '3.10' }
# - name: Install Dependencies
# run: pip install -r requirements.txt
# - name: Run Tests with Coverage
# run: pytest tests/ --cov=src --cov-report=xml
# - name: Data Validation Tests
# run: great_expectations checkpoint run ml_data_check
# - name: SonarQube Scan
# uses: SonarSource/sonarqube-scan-action@v2
# env:
# SONAR_TOKEN: }
# - name: Quality Gate
# uses: SonarSource/sonarqube-quality-gate-action@v1

@dataclass
class PipelineStage:
 stage: str
 tools: str
 checks: str
 gate: str

stages = [
 PipelineStage("1. PR Quality Check",
 "SonarQube + pytest + Great Expectations",
 "Code Quality, Unit Tests, Data Schema Validation",
 "Quality Gate Pass + Tests Pass + Data Valid"),
 PipelineStage("2. Integration Test",
 "pytest + MLflow + Docker",
 "End-to-end Pipeline Test, Model Training Smoke Test",
 "All Tests Pass, Model Metrics > Baseline"),
 PipelineStage("3. Model Training",
 "MLflow + DVC + SonarQube",
 "Data Quality, Training Metrics, Model Validation",
 "Accuracy > Threshold, No Data Drift"),
 PipelineStage("4. Model Registry",
 "MLflow Model Registry",
 "Model Artifact, Metrics, Lineage, Version",
 "Model Approved by Reviewer"),
 PipelineStage("5. Staging Deploy",
 "Kubernetes + Seldon/BentoML",
 "API Tests, Load Test, Security Scan (DAST)",
 "Response Time < 100ms, No Vulnerabilities"),
 PipelineStage("6. Production Deploy",
 "Kubernetes + Canary/Blue-Green",
 "Canary Metrics, A/B Test, Model Monitor",
 "No Degradation in Business Metrics"),
]

print("=== ML CI/CD Pipeline ===")
for s in stages:
 print(f"\n [{s.stage}]")
 print(f" Tools: {s.tools}")
 print(f" Checks: {s.checks}")
 print(f" Gate: {s.gate}")

Testing Strategy

# === ML Testing Strategy ===

@dataclass
class TestCategory:
 category: str
 what_to_test: str
 tools: str
 coverage_target: str

test_categories = [
 TestCategory("Unit Tests (Data)",
 "Data Loading, Transformation, Validation, Encoding",
 "pytest + pandas testing + Great Expectations",
 "> 80% Coverage"),
 TestCategory("Unit Tests (Feature)",
 "Feature Engineering, Scaling, Selection, Encoding",
 "pytest + numpy testing",
 "> 70% Coverage"),
 TestCategory("Unit Tests (Model)",
 "Model Init, Predict Shape, Save/Load, Config",
 "pytest + sklearn metrics",
 "> 50% Coverage (Training ยากทดสอบ)"),
 TestCategory("Integration Tests",
 "End-to-end Pipeline, Data→Feature→Model→Predict",
 "pytest + Docker + Sample Data",
 "> 60% Coverage"),
 TestCategory("API Tests",
 "Endpoint, Input Validation, Error Handling, Auth",
 "pytest + httpx + FastAPI TestClient",
 "> 80% Coverage"),
 TestCategory("Data Quality Tests",
 "Schema, Null, Range, Distribution, Freshness",
 "Great Expectations + Monte Carlo",
 "100% Critical Checks"),
]

print("=== ML Testing Strategy ===")
for t in test_categories:
 print(f" [{t.category}]")
 print(f" Test: {t.what_to_test}")
 print(f" Tools: {t.tools}")
 print(f" Target: {t.coverage_target}")

เคล็ดลับ

  • Coverage: ML Training Code Coverage ต่ำกว่า Web App เป็นปกติ ตั้ง 50-70%
  • Custom Rules: สร้าง Custom Rules สำหรับ ML เช่น Data Leak Random Seed
  • Notebook: แปลง Notebook เป็น .py ก่อน Scan ด้วย nbconvert
  • Profile: แยก Quality Profile สำหรับ ML Code กับ App Code
  • Data Test: ใช้ Great Expectations ตรวจ Data Quality ควบคู่ Code Quality

การนำไปใช้งานจริงในองค์กร

สำหรับองค์กรขนาดกลางถึงใหญ่ แนะนำให้ใช้หลัก Three-Tier Architecture คือ Core Layer ที่เป็นแกนกลางของระบบ Distribution Layer ที่ทำหน้าที่กระจาย Traffic และ Access Layer ที่เชื่อมต่อกับผู้ใช้โดยตรง การแบ่ง Layer ชัดเจนช่วยให้การ Troubleshoot ง่ายขึ้นและสามารถ Scale ระบบได้ตามความต้องการ

เรื่อง Network Security ก็สำคัญไม่แพ้กัน ควรติดตั้ง Next-Generation Firewall ที่สามารถ Deep Packet Inspection ได้ ใช้ Network Segmentation แยก VLAN สำหรับแต่ละแผนก ติดตั้ง IDS/IPS เพื่อตรวจจับการโจมตี และทำ Regular Security Audit อย่างน้อยปีละ 2 ครั้ง

ทำไมต้องใช้ SonarQube กับ ML Pipeline

ML Code Bug Vulnerability Technical Debt Data Ingestion Feature Engineering Training Serving Pipeline Quality Maintainable Secure Coverage

ตรวจอะไรใน ML Code

Data Validation Exception Feature Data Leak Null Hyperparameter Random Seed Model Save API Input Auth Rate Limit Pipeline Retry Timeout Idempotent

Quality Gate ตั้งอย่างไร

Coverage 70% ML 80% API Bug 0 Vulnerability 0 Duplication 5% Hotspots Reviewed Custom Rules Profile Notebook nbconvert

CI/CD Pipeline ทำอย่างไร

PR SonarQube pytest Great Expectations Integration MLflow Training DVC Model Registry Staging DAST Production Canary Monitor GitHub Actions

สรุป

SonarQube ML Pipeline Code Quality Custom Rules Data Leak Random Seed Coverage API Security Great Expectations MLflow CI/CD Production