SiamCafe · Blog
Soda Data Quality Infrastructure as Code —
บทความ

Soda Data Quality Infrastructure as Code —

เผยแพร่ 28 พฤษภาคม 2569

Soda Data Quality IaC

Soda Data Quality SodaCL Infrastructure as Code Terraform Git CI/CD GitOps Version Control Pipeline Automated Checks YAML Data Contracts

ApproachProsConsเหมาะกับ
Manual Checksเริ่มเร็วไม่ ReproduciblePrototyping
Script-basedยืดหยุ่นไม่มี Version ControlSmall Team
IaC (Git+CI/CD)Reproducible AuditableSetup มากกว่าProduction
GitOpsFull Automationซับซ้อนที่สุดEnterprise

SodaCL Checks ใน Git

=== Data Quality as Code ===

Project Structure

data-quality/

├── checks/

│ ├── orders.yml

│ ├── products.yml

│ ├── customers.yml

│ └── payments.yml

├── configuration.yml

├── .github/workflows/

│ └── soda-scan.yml

├── terraform/

│ ├── main.tf

│ └── variables.tf

└── README.md

checks/orders.yml

checks for orders:

  • row_count > 0
  • row_count between 100 and 1000000
  • missing_count(customer_id) = 0
  • missing_count(total_amount) = 0
  • duplicate_count(order_id) = 0
  • freshness(created_at) < 1d
  • invalid_count(status) = 0:

valid values: ["pending", "confirmed", "shipped", "delivered"]

  • avg(total_amount) between 100 and 10000
  • schema:

fail:

when required column missing:

[order_id, customer_id, total_amount, status, created_at]

configuration.yml

data_source production:

type: postgres

host:

port: 5432

username:

password:

database: analytics

schema: public

from dataclasses import dataclass

from typing import List

@dataclass

class QualityCheck:

table: str

check_type: str

check: str

threshold: str

status: str

checks = [

QualityCheck("orders", "row_count", "row_count > 0", "> 0", "PASS"),

QualityCheck("orders", "missing", "missing_count(email) = 0", "= 0", "PASS"),

QualityCheck("orders", "duplicate", "duplicate_count(order_id) = 0", "= 0", "PASS"),

QualityCheck("orders", "freshness", "freshness(created_at) < 1d", "< 1d", "PASS"),

QualityCheck("products", "row_count", "row_count > 0", "> 0", "PASS"),

QualityCheck("products", "missing", "missing_count(price) = 0", "= 0", "FAIL"),

QualityCheck("customers", "schema", "required columns present", "all", "PASS"),

]

print("=== Soda Scan Results ===")

passed = sum(1 for c in checks if c.status == "PASS")

for c in checks:

print(f" [{c.status}] {c.table}.{c.check_type}: {c.check}")

print(f"\n Total: {passed}/{len(checks)} passed")

CI/CD Pipeline

=== GitHub Actions CI/CD ===

.github/workflows/soda-scan.yml

name: Data Quality Scan

on:

push: { branches: [main] }

pull_request: { branches: [main] }

schedule:

  • cron: '0 * * * *'

jobs:

validate:

runs-on: ubuntu-latest

if: github.event_name == 'pull_request'

steps:

  • uses: actions/checkout@v4
  • uses: actions/setup-python@v5

with: { python-version: '3.11' }

  • run: pip install soda-core-postgres
  • run: |

soda test-connection -d staging \

-c configuration.yml

  • run: |

soda scan -d staging \

-c configuration.yml \

checks/*.yml --verbose

env:

DB_HOST: }

DB_PASSWORD: }

production-scan:

runs-on: ubuntu-latest

if: github.event_name == 'schedule' || github.ref == 'refs/heads/main'

steps:

  • uses: actions/checkout@v4
  • uses: actions/setup-python@v5
  • run: pip install soda-core-postgres
  • run: |

soda scan -d production \

-c configuration.yml \

checks/*.yml

env:

DB_HOST: }

DB_PASSWORD: }

  • uses: slackapi/slack-github-action@v1

if: failure()

with:

payload: '{"text":"Data Quality FAILED!"}'

pipeline_stages = [

{"stage": "PR Check", "trigger": "Pull Request", "env": "Staging", "action": "Validate new checks"},

{"stage": "Merge", "trigger": "Push to main", "env": "Production", "action": "Deploy checks"},

{"stage": "Hourly Scan", "trigger": "Cron 0 * * * *", "env": "Production", "action": "Run all checks"},

{"stage": "Alert", "trigger": "Scan Failure", "env": "—", "action": "Slack + PagerDuty"},

]

print("\nCI/CD Pipeline:")

for s in pipeline_stages:

print(f" [{s['stage']}] Trigger: {s['trigger']}")

print(f" Env: {s['env']} | Action: {s['action']}")

Terraform Infrastructure

=== Terraform for Soda Infrastructure ===

terraform/main.tf

provider "google" {

project = var.project_id

region = var.region

}

# Service Account for Soda

resource "google_service_account" "soda" {

account_id = "soda-scanner"

display_name = "Soda Data Quality Scanner"

}

# BigQuery Read Access

resource "google_project_iam_member" "soda_bq" {

project = var.project_id

role = "roles/bigquery.dataViewer"

member = "serviceAccount:"

}

# Cloud Scheduler for Automated Scans

resource "google_cloud_scheduler_job" "soda_scan" {

name = "soda-hourly-scan"

schedule = "0 * * * *"

http_target {

uri = google_cloud_run_service.soda.status[0].url

http_method = "POST"

body = base64encode("{\"scan\": \"all\"}")

}

}

# Secret Manager for Credentials

resource "google_secret_manager_secret" "db_password" {

secret_id = "soda-db-password"

replication { automatic = true }

}

infra_components = {

"Service Account": "Authentication สำหรับ Soda Scanner",

"IAM Roles": "BigQuery dataViewer, Cloud SQL Client",

"Cloud Scheduler": "Trigger Scan ทุกชั่วโมง",

"Cloud Run": "รัน Soda Scanner Container",

"Secret Manager": "เก็บ Database Credentials",

"Pub/Sub": "Alert Topic สำหรับ Scan Results",

"Cloud Monitoring": "Dashboard + Alerting",

}

print("Terraform Components:")

for component, purpose in infra_components.items():

print(f" [{component}]: {purpose}")

GitOps Workflow

gitops = [

"1. Developer เขียน/แก้ SodaCL Check ใน Branch",

"2. PR Review ทีม Data Engineer ตรวจสอบ",

"3. CI รัน Check กับ Staging Database",

"4. Merge to main → Deploy to Production",

"5. Scheduled Scan รันทุกชั่วโมง",

"6. Alert เมื่อ Check Fail",

"7. Rollback ผ่าน Git Revert ถ้าต้องการ",

]

print(f"\n\nGitOps Workflow:")

for step in gitops:

print(f" {step}")

เคล็ดลับ

  • Git: เก็บทุก Check ใน Git Version Control ทุกการเปลี่ยนแปลง
  • PR Review: ทุก Check ใหม่ต้องผ่าน Code Review
  • Staging: ทดสอบ Check กับ Staging ก่อน Production
  • Terraform: จัดการ Infrastructure ทั้งหมดด้วย Terraform
  • Alert: แจ้ง Slack ทันทีเมื่อ Check Fail

Soda Data Quality กับ IaC คืออะไร

SodaCL YAML Git Version Control CI/CD Deploy อัตโนมัติ PR Review Terraform Infrastructure Reproducible Auditable