Soda Data Quality Infrastructure as Code —
Soda Data Quality IaC
Soda Data Quality SodaCL Infrastructure as Code Terraform Git CI/CD GitOps Version Control Pipeline Automated Checks YAML Data Contracts
| Approach | Pros | Cons | เหมาะกับ |
|---|---|---|---|
| Manual Checks | เริ่มเร็ว | ไม่ Reproducible | Prototyping |
| Script-based | ยืดหยุ่น | ไม่มี Version Control | Small Team |
| IaC (Git+CI/CD) | Reproducible Auditable | Setup มากกว่า | Production |
| GitOps | Full Automation | ซับซ้อนที่สุด | Enterprise |
SodaCL Checks ใน Git
=== Data Quality as Code ===
Project Structure
data-quality/
├── checks/
│ ├── orders.yml
│ ├── products.yml
│ ├── customers.yml
│ └── payments.yml
├── configuration.yml
├── .github/workflows/
│ └── soda-scan.yml
├── terraform/
│ ├── main.tf
│ └── variables.tf
└── README.md
checks/orders.yml
checks for orders:
- row_count > 0
- row_count between 100 and 1000000
- missing_count(customer_id) = 0
- missing_count(total_amount) = 0
- duplicate_count(order_id) = 0
- freshness(created_at) < 1d
- invalid_count(status) = 0:
valid values: ["pending", "confirmed", "shipped", "delivered"]
- avg(total_amount) between 100 and 10000
- schema:
fail:
when required column missing:
[order_id, customer_id, total_amount, status, created_at]
configuration.yml
data_source production:
type: postgres
host:
port: 5432
username:
password:
database: analytics
schema: public
from dataclasses import dataclass
from typing import List
@dataclass
class QualityCheck:
table: str
check_type: str
check: str
threshold: str
status: str
checks = [
QualityCheck("orders", "row_count", "row_count > 0", "> 0", "PASS"),
QualityCheck("orders", "missing", "missing_count(email) = 0", "= 0", "PASS"),
QualityCheck("orders", "duplicate", "duplicate_count(order_id) = 0", "= 0", "PASS"),
QualityCheck("orders", "freshness", "freshness(created_at) < 1d", "< 1d", "PASS"),
QualityCheck("products", "row_count", "row_count > 0", "> 0", "PASS"),
QualityCheck("products", "missing", "missing_count(price) = 0", "= 0", "FAIL"),
QualityCheck("customers", "schema", "required columns present", "all", "PASS"),
]
print("=== Soda Scan Results ===")
passed = sum(1 for c in checks if c.status == "PASS")
for c in checks:
print(f" [{c.status}] {c.table}.{c.check_type}: {c.check}")
print(f"\n Total: {passed}/{len(checks)} passed")
CI/CD Pipeline
=== GitHub Actions CI/CD ===
.github/workflows/soda-scan.yml
name: Data Quality Scan
on:
push: { branches: [main] }
pull_request: { branches: [main] }
schedule:
- cron: '0 * * * *'
jobs:
validate:
runs-on: ubuntu-latest
if: github.event_name == 'pull_request'
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with: { python-version: '3.11' }
- run: pip install soda-core-postgres
- run: |
soda test-connection -d staging \
-c configuration.yml
- run: |
soda scan -d staging \
-c configuration.yml \
checks/*.yml --verbose
env:
DB_HOST: }
DB_PASSWORD: }
production-scan:
runs-on: ubuntu-latest
if: github.event_name == 'schedule' || github.ref == 'refs/heads/main'
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
- run: pip install soda-core-postgres
- run: |
soda scan -d production \
-c configuration.yml \
checks/*.yml
env:
DB_HOST: }
DB_PASSWORD: }
- uses: slackapi/slack-github-action@v1
if: failure()
with:
payload: '{"text":"Data Quality FAILED!"}'
pipeline_stages = [
{"stage": "PR Check", "trigger": "Pull Request", "env": "Staging", "action": "Validate new checks"},
{"stage": "Merge", "trigger": "Push to main", "env": "Production", "action": "Deploy checks"},
{"stage": "Hourly Scan", "trigger": "Cron 0 * * * *", "env": "Production", "action": "Run all checks"},
{"stage": "Alert", "trigger": "Scan Failure", "env": "—", "action": "Slack + PagerDuty"},
]
print("\nCI/CD Pipeline:")
for s in pipeline_stages:
print(f" [{s['stage']}] Trigger: {s['trigger']}")
print(f" Env: {s['env']} | Action: {s['action']}")
Terraform Infrastructure
=== Terraform for Soda Infrastructure ===
terraform/main.tf
provider "google" {
project = var.project_id
region = var.region
}
# Service Account for Soda
resource "google_service_account" "soda" {
account_id = "soda-scanner"
display_name = "Soda Data Quality Scanner"
}
# BigQuery Read Access
resource "google_project_iam_member" "soda_bq" {
project = var.project_id
role = "roles/bigquery.dataViewer"
member = "serviceAccount:"
}
# Cloud Scheduler for Automated Scans
resource "google_cloud_scheduler_job" "soda_scan" {
name = "soda-hourly-scan"
schedule = "0 * * * *"
http_target {
uri = google_cloud_run_service.soda.status[0].url
http_method = "POST"
body = base64encode("{\"scan\": \"all\"}")
}
}
# Secret Manager for Credentials
resource "google_secret_manager_secret" "db_password" {
secret_id = "soda-db-password"
replication { automatic = true }
}
infra_components = {
"Service Account": "Authentication สำหรับ Soda Scanner",
"IAM Roles": "BigQuery dataViewer, Cloud SQL Client",
"Cloud Scheduler": "Trigger Scan ทุกชั่วโมง",
"Cloud Run": "รัน Soda Scanner Container",
"Secret Manager": "เก็บ Database Credentials",
"Pub/Sub": "Alert Topic สำหรับ Scan Results",
"Cloud Monitoring": "Dashboard + Alerting",
}
print("Terraform Components:")
for component, purpose in infra_components.items():
print(f" [{component}]: {purpose}")
GitOps Workflow
gitops = [
"1. Developer เขียน/แก้ SodaCL Check ใน Branch",
"2. PR Review ทีม Data Engineer ตรวจสอบ",
"3. CI รัน Check กับ Staging Database",
"4. Merge to main → Deploy to Production",
"5. Scheduled Scan รันทุกชั่วโมง",
"6. Alert เมื่อ Check Fail",
"7. Rollback ผ่าน Git Revert ถ้าต้องการ",
]
print(f"\n\nGitOps Workflow:")
for step in gitops:
print(f" {step}")
เคล็ดลับ
- Git: เก็บทุก Check ใน Git Version Control ทุกการเปลี่ยนแปลง
- PR Review: ทุก Check ใหม่ต้องผ่าน Code Review
- Staging: ทดสอบ Check กับ Staging ก่อน Production
- Terraform: จัดการ Infrastructure ทั้งหมดด้วย Terraform
- Alert: แจ้ง Slack ทันทีเมื่อ Check Fail
Soda Data Quality กับ IaC คืออะไร
SodaCL YAML Git Version Control CI/CD Deploy อัตโนมัติ PR Review Terraform Infrastructure Reproducible Auditable