SiamCafe.net Blog
Technology

Soda Data Quality Infrastructure as Code

soda data quality infrastructure as code
Soda Data Quality Infrastructure as Code | SiamCafe Blog
2025-07-22· อ. บอม — SiamCafe.net· 8,323 คำ

Soda Data Quality IaC

Soda Data Quality SodaCL Infrastructure as Code Terraform Git CI/CD GitOps Version Control Pipeline Automated Checks YAML Data Contracts

ApproachProsConsเหมาะกับ
Manual Checksเริ่มเร็วไม่ ReproduciblePrototyping
Script-basedยืดหยุ่นไม่มี Version ControlSmall Team
IaC (Git+CI/CD)Reproducible AuditableSetup มากกว่าProduction
GitOpsFull Automationซับซ้อนที่สุดEnterprise

SodaCL Checks ใน Git

# === Data Quality as Code ===

# Project Structure
# data-quality/
# ├── checks/
# │   ├── orders.yml
# │   ├── products.yml
# │   ├── customers.yml
# │   └── payments.yml
# ├── configuration.yml
# ├── .github/workflows/
# │   └── soda-scan.yml
# ├── terraform/
# │   ├── main.tf
# │   └── variables.tf
# └── README.md

# checks/orders.yml
# checks for orders:
#   - row_count > 0
#   - row_count between 100 and 1000000
#   - missing_count(customer_id) = 0
#   - missing_count(total_amount) = 0
#   - duplicate_count(order_id) = 0
#   - freshness(created_at) < 1d
#   - invalid_count(status) = 0:
#       valid values: ["pending", "confirmed", "shipped", "delivered"]
#   - avg(total_amount) between 100 and 10000
#   - schema:
#       fail:
#         when required column missing:
#           [order_id, customer_id, total_amount, status, created_at]

# configuration.yml
# data_source production:
#   type: postgres
#   host: 
#   port: 5432
#   username: 
#   password: 
#   database: analytics
#   schema: public

from dataclasses import dataclass
from typing import List

@dataclass
class QualityCheck:
    table: str
    check_type: str
    check: str
    threshold: str
    status: str

checks = [
    QualityCheck("orders", "row_count", "row_count > 0", "> 0", "PASS"),
    QualityCheck("orders", "missing", "missing_count(email) = 0", "= 0", "PASS"),
    QualityCheck("orders", "duplicate", "duplicate_count(order_id) = 0", "= 0", "PASS"),
    QualityCheck("orders", "freshness", "freshness(created_at) < 1d", "< 1d", "PASS"),
    QualityCheck("products", "row_count", "row_count > 0", "> 0", "PASS"),
    QualityCheck("products", "missing", "missing_count(price) = 0", "= 0", "FAIL"),
    QualityCheck("customers", "schema", "required columns present", "all", "PASS"),
]

print("=== Soda Scan Results ===")
passed = sum(1 for c in checks if c.status == "PASS")
for c in checks:
    print(f"  [{c.status}] {c.table}.{c.check_type}: {c.check}")
print(f"\n  Total: {passed}/{len(checks)} passed")

CI/CD Pipeline

# === GitHub Actions CI/CD ===

# .github/workflows/soda-scan.yml
# name: Data Quality Scan
# on:
#   push: { branches: [main] }
#   pull_request: { branches: [main] }
#   schedule:
#     - cron: '0 * * * *'
#
# jobs:
#   validate:
#     runs-on: ubuntu-latest
#     if: github.event_name == 'pull_request'
#     steps:
#       - uses: actions/checkout@v4
#       - uses: actions/setup-python@v5
#         with: { python-version: '3.11' }
#       - run: pip install soda-core-postgres
#       - run: |
#           soda test-connection -d staging \
#             -c configuration.yml
#       - run: |
#           soda scan -d staging \
#             -c configuration.yml \
#             checks/*.yml --verbose
#         env:
#           DB_HOST: }
#           DB_PASSWORD: }
#
#   production-scan:
#     runs-on: ubuntu-latest
#     if: github.event_name == 'schedule' || github.ref == 'refs/heads/main'
#     steps:
#       - uses: actions/checkout@v4
#       - uses: actions/setup-python@v5
#       - run: pip install soda-core-postgres
#       - run: |
#           soda scan -d production \
#             -c configuration.yml \
#             checks/*.yml
#         env:
#           DB_HOST: }
#           DB_PASSWORD: }
#       - uses: slackapi/slack-github-action@v1
#         if: failure()
#         with:
#           payload: '{"text":"Data Quality FAILED!"}'

pipeline_stages = [
    {"stage": "PR Check", "trigger": "Pull Request", "env": "Staging", "action": "Validate new checks"},
    {"stage": "Merge", "trigger": "Push to main", "env": "Production", "action": "Deploy checks"},
    {"stage": "Hourly Scan", "trigger": "Cron 0 * * * *", "env": "Production", "action": "Run all checks"},
    {"stage": "Alert", "trigger": "Scan Failure", "env": "—", "action": "Slack + PagerDuty"},
]

print("\nCI/CD Pipeline:")
for s in pipeline_stages:
    print(f"  [{s['stage']}] Trigger: {s['trigger']}")
    print(f"    Env: {s['env']} | Action: {s['action']}")

Terraform Infrastructure

# === Terraform for Soda Infrastructure ===

# terraform/main.tf
# provider "google" {
#   project = var.project_id
#   region  = var.region
# }
#
# # Service Account for Soda
# resource "google_service_account" "soda" {
#   account_id   = "soda-scanner"
#   display_name = "Soda Data Quality Scanner"
# }
#
# # BigQuery Read Access
# resource "google_project_iam_member" "soda_bq" {
#   project = var.project_id
#   role    = "roles/bigquery.dataViewer"
#   member  = "serviceAccount:"
# }
#
# # Cloud Scheduler for Automated Scans
# resource "google_cloud_scheduler_job" "soda_scan" {
#   name     = "soda-hourly-scan"
#   schedule = "0 * * * *"
#
#   http_target {
#     uri         = google_cloud_run_service.soda.status[0].url
#     http_method = "POST"
#     body        = base64encode("{\"scan\": \"all\"}")
#   }
# }
#
# # Secret Manager for Credentials
# resource "google_secret_manager_secret" "db_password" {
#   secret_id = "soda-db-password"
#   replication { automatic = true }
# }

infra_components = {
    "Service Account": "Authentication สำหรับ Soda Scanner",
    "IAM Roles": "BigQuery dataViewer, Cloud SQL Client",
    "Cloud Scheduler": "Trigger Scan ทุกชั่วโมง",
    "Cloud Run": "รัน Soda Scanner Container",
    "Secret Manager": "เก็บ Database Credentials",
    "Pub/Sub": "Alert Topic สำหรับ Scan Results",
    "Cloud Monitoring": "Dashboard + Alerting",
}

print("Terraform Components:")
for component, purpose in infra_components.items():
    print(f"  [{component}]: {purpose}")

# GitOps Workflow
gitops = [
    "1. Developer เขียน/แก้ SodaCL Check ใน Branch",
    "2. PR Review ทีม Data Engineer ตรวจสอบ",
    "3. CI รัน Check กับ Staging Database",
    "4. Merge to main → Deploy to Production",
    "5. Scheduled Scan รันทุกชั่วโมง",
    "6. Alert เมื่อ Check Fail",
    "7. Rollback ผ่าน Git Revert ถ้าต้องการ",
]

print(f"\n\nGitOps Workflow:")
for step in gitops:
    print(f"  {step}")

เคล็ดลับ

Soda Data Quality กับ IaC คืออะไร

SodaCL YAML Git Version Control CI/CD Deploy อัตโนมัติ PR Review Terraform Infrastructure Reproducible Auditable

SodaCL เขียน Check อย่างไร

YAML row_count missing_count duplicate_count freshness schema valid_count Custom SQL ง่าย อ่านเข้าใจ

ทำไมต้องเก็บ Data Quality Checks ใน Git

Version Control ประวัติ Code Review Audit Trail Rollback Collaboration CI/CD Branching GitOps ดีที่สุด

Terraform ใช้กับ Soda อย่างไร

Service Account IAM Cloud Scheduler Cloud Run Secret Manager Pub/Sub Monitoring Infrastructure เป็น Code ทั้งหมด

สรุป

Soda Data Quality SodaCL YAML Infrastructure as Code Terraform Git CI/CD GitOps Version Control Pipeline Automated Checks PR Review Staging Production Alert Monitoring

📖 บทความที่เกี่ยวข้อง

Soda Data Quality Scaling Strategy วิธี Scaleอ่านบทความ → Soda Data Quality Identity Access Managementอ่านบทความ → Soda Data Quality Home Lab Setupอ่านบทความ → Soda Data Quality Container Orchestrationอ่านบทความ → Soda Data Quality API Gateway Patternอ่านบทความ →

📚 ดูบทความทั้งหมด →