Soda Data Quality IaC
Soda Data Quality SodaCL Infrastructure as Code Terraform Git CI/CD GitOps Version Control Pipeline Automated Checks YAML Data Contracts
| Approach | Pros | Cons | เหมาะกับ |
|---|---|---|---|
| Manual Checks | เริ่มเร็ว | ไม่ Reproducible | Prototyping |
| Script-based | ยืดหยุ่น | ไม่มี Version Control | Small Team |
| IaC (Git+CI/CD) | Reproducible Auditable | Setup มากกว่า | Production |
| GitOps | Full Automation | ซับซ้อนที่สุด | Enterprise |
SodaCL Checks ใน Git
# === Data Quality as Code ===
# Project Structure
# data-quality/
# ├── checks/
# │ ├── orders.yml
# │ ├── products.yml
# │ ├── customers.yml
# │ └── payments.yml
# ├── configuration.yml
# ├── .github/workflows/
# │ └── soda-scan.yml
# ├── terraform/
# │ ├── main.tf
# │ └── variables.tf
# └── README.md
# checks/orders.yml
# checks for orders:
# - row_count > 0
# - row_count between 100 and 1000000
# - missing_count(customer_id) = 0
# - missing_count(total_amount) = 0
# - duplicate_count(order_id) = 0
# - freshness(created_at) < 1d
# - invalid_count(status) = 0:
# valid values: ["pending", "confirmed", "shipped", "delivered"]
# - avg(total_amount) between 100 and 10000
# - schema:
# fail:
# when required column missing:
# [order_id, customer_id, total_amount, status, created_at]
# configuration.yml
# data_source production:
# type: postgres
# host:
# port: 5432
# username:
# password:
# database: analytics
# schema: public
from dataclasses import dataclass
from typing import List
@dataclass
class QualityCheck:
table: str
check_type: str
check: str
threshold: str
status: str
checks = [
QualityCheck("orders", "row_count", "row_count > 0", "> 0", "PASS"),
QualityCheck("orders", "missing", "missing_count(email) = 0", "= 0", "PASS"),
QualityCheck("orders", "duplicate", "duplicate_count(order_id) = 0", "= 0", "PASS"),
QualityCheck("orders", "freshness", "freshness(created_at) < 1d", "< 1d", "PASS"),
QualityCheck("products", "row_count", "row_count > 0", "> 0", "PASS"),
QualityCheck("products", "missing", "missing_count(price) = 0", "= 0", "FAIL"),
QualityCheck("customers", "schema", "required columns present", "all", "PASS"),
]
print("=== Soda Scan Results ===")
passed = sum(1 for c in checks if c.status == "PASS")
for c in checks:
print(f" [{c.status}] {c.table}.{c.check_type}: {c.check}")
print(f"\n Total: {passed}/{len(checks)} passed")
CI/CD Pipeline
# === GitHub Actions CI/CD ===
# .github/workflows/soda-scan.yml
# name: Data Quality Scan
# on:
# push: { branches: [main] }
# pull_request: { branches: [main] }
# schedule:
# - cron: '0 * * * *'
#
# jobs:
# validate:
# runs-on: ubuntu-latest
# if: github.event_name == 'pull_request'
# steps:
# - uses: actions/checkout@v4
# - uses: actions/setup-python@v5
# with: { python-version: '3.11' }
# - run: pip install soda-core-postgres
# - run: |
# soda test-connection -d staging \
# -c configuration.yml
# - run: |
# soda scan -d staging \
# -c configuration.yml \
# checks/*.yml --verbose
# env:
# DB_HOST: }
# DB_PASSWORD: }
#
# production-scan:
# runs-on: ubuntu-latest
# if: github.event_name == 'schedule' || github.ref == 'refs/heads/main'
# steps:
# - uses: actions/checkout@v4
# - uses: actions/setup-python@v5
# - run: pip install soda-core-postgres
# - run: |
# soda scan -d production \
# -c configuration.yml \
# checks/*.yml
# env:
# DB_HOST: }
# DB_PASSWORD: }
# - uses: slackapi/slack-github-action@v1
# if: failure()
# with:
# payload: '{"text":"Data Quality FAILED!"}'
pipeline_stages = [
{"stage": "PR Check", "trigger": "Pull Request", "env": "Staging", "action": "Validate new checks"},
{"stage": "Merge", "trigger": "Push to main", "env": "Production", "action": "Deploy checks"},
{"stage": "Hourly Scan", "trigger": "Cron 0 * * * *", "env": "Production", "action": "Run all checks"},
{"stage": "Alert", "trigger": "Scan Failure", "env": "—", "action": "Slack + PagerDuty"},
]
print("\nCI/CD Pipeline:")
for s in pipeline_stages:
print(f" [{s['stage']}] Trigger: {s['trigger']}")
print(f" Env: {s['env']} | Action: {s['action']}")
Terraform Infrastructure
# === Terraform for Soda Infrastructure ===
# terraform/main.tf
# provider "google" {
# project = var.project_id
# region = var.region
# }
#
# # Service Account for Soda
# resource "google_service_account" "soda" {
# account_id = "soda-scanner"
# display_name = "Soda Data Quality Scanner"
# }
#
# # BigQuery Read Access
# resource "google_project_iam_member" "soda_bq" {
# project = var.project_id
# role = "roles/bigquery.dataViewer"
# member = "serviceAccount:"
# }
#
# # Cloud Scheduler for Automated Scans
# resource "google_cloud_scheduler_job" "soda_scan" {
# name = "soda-hourly-scan"
# schedule = "0 * * * *"
#
# http_target {
# uri = google_cloud_run_service.soda.status[0].url
# http_method = "POST"
# body = base64encode("{\"scan\": \"all\"}")
# }
# }
#
# # Secret Manager for Credentials
# resource "google_secret_manager_secret" "db_password" {
# secret_id = "soda-db-password"
# replication { automatic = true }
# }
infra_components = {
"Service Account": "Authentication สำหรับ Soda Scanner",
"IAM Roles": "BigQuery dataViewer, Cloud SQL Client",
"Cloud Scheduler": "Trigger Scan ทุกชั่วโมง",
"Cloud Run": "รัน Soda Scanner Container",
"Secret Manager": "เก็บ Database Credentials",
"Pub/Sub": "Alert Topic สำหรับ Scan Results",
"Cloud Monitoring": "Dashboard + Alerting",
}
print("Terraform Components:")
for component, purpose in infra_components.items():
print(f" [{component}]: {purpose}")
# GitOps Workflow
gitops = [
"1. Developer เขียน/แก้ SodaCL Check ใน Branch",
"2. PR Review ทีม Data Engineer ตรวจสอบ",
"3. CI รัน Check กับ Staging Database",
"4. Merge to main → Deploy to Production",
"5. Scheduled Scan รันทุกชั่วโมง",
"6. Alert เมื่อ Check Fail",
"7. Rollback ผ่าน Git Revert ถ้าต้องการ",
]
print(f"\n\nGitOps Workflow:")
for step in gitops:
print(f" {step}")
เคล็ดลับ
- Git: เก็บทุก Check ใน Git Version Control ทุกการเปลี่ยนแปลง
- PR Review: ทุก Check ใหม่ต้องผ่าน Code Review
- Staging: ทดสอบ Check กับ Staging ก่อน Production
- Terraform: จัดการ Infrastructure ทั้งหมดด้วย Terraform
- Alert: แจ้ง Slack ทันทีเมื่อ Check Fail
Soda Data Quality กับ IaC คืออะไร
SodaCL YAML Git Version Control CI/CD Deploy อัตโนมัติ PR Review Terraform Infrastructure Reproducible Auditable
SodaCL เขียน Check อย่างไร
YAML row_count missing_count duplicate_count freshness schema valid_count Custom SQL ง่าย อ่านเข้าใจ
ทำไมต้องเก็บ Data Quality Checks ใน Git
Version Control ประวัติ Code Review Audit Trail Rollback Collaboration CI/CD Branching GitOps ดีที่สุด
Terraform ใช้กับ Soda อย่างไร
Service Account IAM Cloud Scheduler Cloud Run Secret Manager Pub/Sub Monitoring Infrastructure เป็น Code ทั้งหมด
สรุป
Soda Data Quality SodaCL YAML Infrastructure as Code Terraform Git CI/CD GitOps Version Control Pipeline Automated Checks PR Review Staging Production Alert Monitoring
