SiamCafe.net Blog
Technology

SonarQube Analysis Machine Learning Pipeline

sonarqube analysis machine learning pipeline
SonarQube Analysis Machine Learning Pipeline | SiamCafe Blog
2026-01-31· อ. บอม — SiamCafe.net· 10,397 คำ

SonarQube ML Pipeline

SonarQube Machine Learning Pipeline Code Quality Bug Vulnerability Coverage Data Validation Feature Engineering Model Training Serving CI/CD

Pipeline StageCode TypeSonarQube ChecksCoverage Target
Data IngestionPython ETLBug, Exception Handling, Null> 80%
Feature EngineeringPython/SparkData Leak, Null, Duplication> 70%
Model TrainingPython MLReproducibility, Hardcode> 50%
Model Serving APIFastAPI/FlaskVulnerability, Auth, Input> 80%
Pipeline OrchestrationAirflow/DagsterDAG, Retry, Timeout> 70%

ML Code Quality Rules

# === ML-specific Quality Rules ===

from dataclasses import dataclass

@dataclass
class MLQualityRule:
    rule_id: str
    name: str
    severity: str
    category: str
    bad_example: str
    good_example: str

rules = [
    MLQualityRule("ML001",
        "ห้าม Hardcode File Path",
        "MAJOR",
        "Data Ingestion",
        "df = pd.read_csv('/home/user/data/train.csv')",
        "df = pd.read_csv(config.DATA_PATH / 'train.csv')"),
    MLQualityRule("ML002",
        "ต้องตั้ง Random Seed",
        "CRITICAL",
        "Model Training",
        "model = RandomForestClassifier()",
        "model = RandomForestClassifier(random_state=config.SEED)"),
    MLQualityRule("ML003",
        "ห้าม print() ใน Production Code",
        "MINOR",
        "All",
        "print(f'Training loss: {loss}')",
        "logger.info(f'Training loss: {loss}')"),
    MLQualityRule("ML004",
        "ต้อง Validate Input ใน API",
        "CRITICAL",
        "Model Serving",
        "prediction = model.predict(request.json['data'])",
        "validated = InputSchema(**request.json); prediction = model.predict(validated.data)"),
    MLQualityRule("ML005",
        "ห้าม Data Leak (ใช้ Test Data ก่อน Split)",
        "BLOCKER",
        "Feature Engineering",
        "scaler.fit(all_data); X_train = scaler.transform(X_train)",
        "scaler.fit(X_train); X_test = scaler.transform(X_test)"),
    MLQualityRule("ML006",
        "ต้อง Handle Missing Values",
        "MAJOR",
        "Feature Engineering",
        "features = df[['col1', 'col2']].values",
        "features = df[['col1', 'col2']].fillna(strategy).values"),
]

print("=== ML Quality Rules ===")
for r in rules:
    print(f"\n  [{r.rule_id}] {r.name} ({r.severity})")
    print(f"    Category: {r.category}")
    print(f"    Bad:  {r.bad_example}")
    print(f"    Good: {r.good_example}")

CI/CD Integration

# === ML Pipeline CI/CD with SonarQube ===

# GitHub Actions for ML Pipeline
# name: ML Pipeline Quality
# on: [pull_request]
# jobs:
#   quality:
#     runs-on: ubuntu-latest
#     steps:
#       - uses: actions/checkout@v4
#       - name: Setup Python
#         uses: actions/setup-python@v5
#         with: { python-version: '3.10' }
#       - name: Install Dependencies
#         run: pip install -r requirements.txt
#       - name: Run Tests with Coverage
#         run: pytest tests/ --cov=src --cov-report=xml
#       - name: Data Validation Tests
#         run: great_expectations checkpoint run ml_data_check
#       - name: SonarQube Scan
#         uses: SonarSource/sonarqube-scan-action@v2
#         env:
#           SONAR_TOKEN: }
#       - name: Quality Gate
#         uses: SonarSource/sonarqube-quality-gate-action@v1

@dataclass
class PipelineStage:
    stage: str
    tools: str
    checks: str
    gate: str

stages = [
    PipelineStage("1. PR Quality Check",
        "SonarQube + pytest + Great Expectations",
        "Code Quality, Unit Tests, Data Schema Validation",
        "Quality Gate Pass + Tests Pass + Data Valid"),
    PipelineStage("2. Integration Test",
        "pytest + MLflow + Docker",
        "End-to-end Pipeline Test, Model Training Smoke Test",
        "All Tests Pass, Model Metrics > Baseline"),
    PipelineStage("3. Model Training",
        "MLflow + DVC + SonarQube",
        "Data Quality, Training Metrics, Model Validation",
        "Accuracy > Threshold, No Data Drift"),
    PipelineStage("4. Model Registry",
        "MLflow Model Registry",
        "Model Artifact, Metrics, Lineage, Version",
        "Model Approved by Reviewer"),
    PipelineStage("5. Staging Deploy",
        "Kubernetes + Seldon/BentoML",
        "API Tests, Load Test, Security Scan (DAST)",
        "Response Time < 100ms, No Vulnerabilities"),
    PipelineStage("6. Production Deploy",
        "Kubernetes + Canary/Blue-Green",
        "Canary Metrics, A/B Test, Model Monitor",
        "No Degradation in Business Metrics"),
]

print("=== ML CI/CD Pipeline ===")
for s in stages:
    print(f"\n  [{s.stage}]")
    print(f"    Tools: {s.tools}")
    print(f"    Checks: {s.checks}")
    print(f"    Gate: {s.gate}")

Testing Strategy

# === ML Testing Strategy ===

@dataclass
class TestCategory:
    category: str
    what_to_test: str
    tools: str
    coverage_target: str

test_categories = [
    TestCategory("Unit Tests (Data)",
        "Data Loading, Transformation, Validation, Encoding",
        "pytest + pandas testing + Great Expectations",
        "> 80% Coverage"),
    TestCategory("Unit Tests (Feature)",
        "Feature Engineering, Scaling, Selection, Encoding",
        "pytest + numpy testing",
        "> 70% Coverage"),
    TestCategory("Unit Tests (Model)",
        "Model Init, Predict Shape, Save/Load, Config",
        "pytest + sklearn metrics",
        "> 50% Coverage (Training ยากทดสอบ)"),
    TestCategory("Integration Tests",
        "End-to-end Pipeline, Data→Feature→Model→Predict",
        "pytest + Docker + Sample Data",
        "> 60% Coverage"),
    TestCategory("API Tests",
        "Endpoint, Input Validation, Error Handling, Auth",
        "pytest + httpx + FastAPI TestClient",
        "> 80% Coverage"),
    TestCategory("Data Quality Tests",
        "Schema, Null, Range, Distribution, Freshness",
        "Great Expectations + Monte Carlo",
        "100% Critical Checks"),
]

print("=== ML Testing Strategy ===")
for t in test_categories:
    print(f"  [{t.category}]")
    print(f"    Test: {t.what_to_test}")
    print(f"    Tools: {t.tools}")
    print(f"    Target: {t.coverage_target}")

เคล็ดลับ

การนำไปใช้งานจริงในองค์กร

สำหรับองค์กรขนาดกลางถึงใหญ่ แนะนำให้ใช้หลัก Three-Tier Architecture คือ Core Layer ที่เป็นแกนกลางของระบบ Distribution Layer ที่ทำหน้าที่กระจาย Traffic และ Access Layer ที่เชื่อมต่อกับผู้ใช้โดยตรง การแบ่ง Layer ชัดเจนช่วยให้การ Troubleshoot ง่ายขึ้นและสามารถ Scale ระบบได้ตามความต้องการ

เรื่อง Network Security ก็สำคัญไม่แพ้กัน ควรติดตั้ง Next-Generation Firewall ที่สามารถ Deep Packet Inspection ได้ ใช้ Network Segmentation แยก VLAN สำหรับแต่ละแผนก ติดตั้ง IDS/IPS เพื่อตรวจจับการโจมตี และทำ Regular Security Audit อย่างน้อยปีละ 2 ครั้ง

ทำไมต้องใช้ SonarQube กับ ML Pipeline

ML Code Bug Vulnerability Technical Debt Data Ingestion Feature Engineering Training Serving Pipeline Quality Maintainable Secure Coverage

ตรวจอะไรใน ML Code

Data Validation Exception Feature Data Leak Null Hyperparameter Random Seed Model Save API Input Auth Rate Limit Pipeline Retry Timeout Idempotent

Quality Gate ตั้งอย่างไร

Coverage 70% ML 80% API Bug 0 Vulnerability 0 Duplication 5% Hotspots Reviewed Custom Rules Profile Notebook nbconvert

CI/CD Pipeline ทำอย่างไร

PR SonarQube pytest Great Expectations Integration MLflow Training DVC Model Registry Staging DAST Production Canary Monitor GitHub Actions

สรุป

SonarQube ML Pipeline Code Quality Custom Rules Data Leak Random Seed Coverage API Security Great Expectations MLflow CI/CD Production

📖 บทความที่เกี่ยวข้อง

SonarQube Analysis Disaster Recovery Planอ่านบทความ → SonarQube Analysis Cache Strategy Redisอ่านบทความ → Linux io_uring Machine Learning Pipelineอ่านบทความ → machine learning reinforcement learning คืออ่านบทความ → SonarQube Analysis Message Queue Designอ่านบทความ →

📚 ดูบทความทั้งหมด →