BigQuery Observability
BigQuery Scheduled Query Observability Stack Monitoring Alerting Logging Dashboard Data Pipeline ETL Production
| Component | Tool | Data Source | Use Case |
|---|---|---|---|
| Metrics | Cloud Monitoring | BigQuery Metrics API | Duration, Bytes, Slots |
| Logs | Cloud Logging | Audit Logs, Job Logs | Error, Status, Details |
| Job History | INFORMATION_SCHEMA | BigQuery Internal | Performance Analysis |
| Alerts | Cloud Monitoring | Metrics + Logs | Fail, Slow, Cost |
| Dashboard | Looker Studio / Grafana | All above | Overview, Trend |
Scheduled Query Setup
# === BigQuery Scheduled Query with Terraform ===
# resource "google_bigquery_data_transfer_config" "daily_etl" {
# display_name = "daily_orders_etl"
# data_source_id = "scheduled_query"
# schedule = "every day 02:00"
# destination_dataset_id = "analytics"
# location = "asia-southeast1"
#
# params = {
# destination_table_name_template = "orders_daily_{run_date}"
# write_disposition = "WRITE_TRUNCATE"
# query = <<-SQL
# SELECT
# DATE(created_at) as order_date,
# COUNT(*) as order_count,
# SUM(total_amount) as total_revenue,
# COUNT(DISTINCT customer_id) as unique_customers
# FROM `project.raw.orders`
# WHERE DATE(created_at) = @run_date
# GROUP BY 1
# SQL
# }
#
# email_preferences {
# enable_failure_email = true
# }
# }
# bq CLI
# bq mk --transfer_config \
# --target_dataset=analytics \
# --display_name='daily_orders_etl' \
# --data_source=scheduled_query \
# --schedule='every day 02:00' \
# --params='{
# "query": "SELECT ... FROM ... WHERE DATE(created_at) = @run_date",
# "destination_table_name_template": "orders_daily_{run_date}",
# "write_disposition": "WRITE_TRUNCATE"
# }'
from dataclasses import dataclass
@dataclass
class ScheduledQuery:
name: str
schedule: str
purpose: str
destination: str
alert: str
queries = [
ScheduledQuery("daily_orders_etl",
"every day 02:00",
"รวม Orders รายวัน Revenue Customers",
"analytics.orders_daily_{run_date}",
"Fail → Slack #data-alerts"),
ScheduledQuery("hourly_metrics_agg",
"every 1 hours",
"รวม Application Metrics ทุกชั่วโมง",
"analytics.app_metrics_hourly",
"Fail + Duration > 10min → PagerDuty"),
ScheduledQuery("data_quality_check",
"every day 03:00 (หลัง ETL)",
"ตรวจ Row Count NULL Freshness",
"monitoring.dq_results",
"Quality Fail → Slack + Jira Ticket"),
ScheduledQuery("weekly_report",
"every monday 06:00",
"สร้าง Weekly Summary Report",
"reports.weekly_summary",
"Fail → Email to Data Team"),
]
print("=== Scheduled Queries ===")
for q in queries:
print(f" [{q.name}] Schedule: {q.schedule}")
print(f" Purpose: {q.purpose}")
print(f" Destination: {q.destination}")
print(f" Alert: {q.alert}")
Monitoring & Alerting
# === Observability Setup ===
# INFORMATION_SCHEMA query for job monitoring
# SELECT
# job_id,
# creation_time,
# end_time,
# TIMESTAMP_DIFF(end_time, start_time, SECOND) as duration_sec,
# total_bytes_processed,
# total_slot_ms,
# state,
# error_result.reason as error_reason,
# error_result.message as error_message
# FROM `region-asia-southeast1`.INFORMATION_SCHEMA.JOBS
# WHERE job_type = 'QUERY'
# AND creation_time > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 24 HOUR)
# ORDER BY creation_time DESC
# LIMIT 100;
# Data Quality Check Query
# SELECT
# 'orders' as table_name,
# COUNT(*) as row_count,
# COUNTIF(customer_id IS NULL) as null_customer_id,
# COUNTIF(total_amount IS NULL) as null_total_amount,
# MAX(created_at) as latest_record,
# TIMESTAMP_DIFF(CURRENT_TIMESTAMP(), MAX(created_at), HOUR) as freshness_hours
# FROM `project.raw.orders`
# WHERE DATE(created_at) = CURRENT_DATE();
@dataclass
class AlertRule:
alert: str
condition: str
severity: str
channel: str
runbook: str
alerts = [
AlertRule("Scheduled Query Failed",
"transfer_config state = FAILED",
"Critical",
"Slack #data-alerts + PagerDuty",
"ตรวจ Error Message → แก้ Query → Manual Rerun"),
AlertRule("Query Duration > SLA",
"duration_sec > 2x average",
"Warning",
"Slack #data-alerts",
"ตรวจ Query Plan → Optimize Partitioning/Clustering"),
AlertRule("Bytes Processed Spike",
"total_bytes_processed > 3x average",
"Warning",
"Slack #data-costs",
"ตรวจ Full Table Scan → เพิ่ม WHERE Clause"),
AlertRule("Data Quality Failure",
"row_count = 0 OR null_rate > 5% OR freshness > 4hr",
"Critical",
"Slack + Jira Ticket",
"ตรวจ Source Data → แจ้ง Data Producer"),
AlertRule("Cost Budget Exceeded",
"daily_bytes > budget_bytes",
"Warning",
"Email to Data Lead",
"Review Expensive Queries → Optimize or Limit"),
]
print("=== Alert Rules ===")
for a in alerts:
print(f" [{a.alert}] Severity: {a.severity}")
print(f" Condition: {a.condition}")
print(f" Channel: {a.channel}")
print(f" Runbook: {a.runbook}")
Dashboard
# === Observability Dashboard ===
@dataclass
class DashboardPanel:
panel: str
source: str
visualization: str
refresh: str
panels = [
DashboardPanel("Scheduled Query Status",
"INFORMATION_SCHEMA.JOBS + Transfer Config API",
"Status Table: Green=Success Red=Failed Yellow=Running",
"ทุก 5 นาที"),
DashboardPanel("Query Duration Trend",
"INFORMATION_SCHEMA.JOBS duration_sec",
"Line Chart: duration per query over time + SLA line",
"ทุก 15 นาที"),
DashboardPanel("Bytes Processed Daily",
"INFORMATION_SCHEMA.JOBS total_bytes_processed",
"Bar Chart: daily bytes + cost estimate + budget line",
"ทุกชั่วโมง"),
DashboardPanel("Slot Utilization",
"Cloud Monitoring bigquery.googleapis.com/slots",
"Area Chart: slot usage over time + capacity line",
"ทุก 5 นาที"),
DashboardPanel("Data Quality Scores",
"monitoring.dq_results table",
"Scorecard: per table completeness freshness accuracy",
"หลังทุก DQ Check"),
DashboardPanel("Error Log",
"Cloud Logging BigQuery Audit Log",
"Log Table: recent errors with details and job_id",
"Real-time"),
]
print("=== Dashboard Panels ===")
for p in panels:
print(f" [{p.panel}] Source: {p.source}")
print(f" Viz: {p.visualization}")
print(f" Refresh: {p.refresh}")
เคล็ดลับ
- Alert: ตั้ง Alert ทุก Scheduled Query Fail เป็นอันดับแรก
- DQ: ตั้ง Data Quality Check หลังทุก ETL Job ทุกครั้ง
- Cost: Monitor Bytes Processed ป้องกัน Cost Spike
- Terraform: จัดการ Scheduled Query เป็น Code ใน Git
- INFORMATION_SCHEMA: ใช้วิเคราะห์ Performance ดีกว่า Console
การนำไปใช้งานจริงในองค์กร
สำหรับองค์กรขนาดกลางถึงใหญ่ แนะนำให้ใช้หลัก Three-Tier Architecture คือ Core Layer ที่เป็นแกนกลางของระบบ Distribution Layer ที่ทำหน้าที่กระจาย Traffic และ Access Layer ที่เชื่อมต่อกับผู้ใช้โดยตรง การแบ่ง Layer ชัดเจนช่วยให้การ Troubleshoot ง่ายขึ้นและสามารถ Scale ระบบได้ตามความต้องการ
เรื่อง Network Security ก็สำคัญไม่แพ้กัน ควรติดตั้ง Next-Generation Firewall ที่สามารถ Deep Packet Inspection ได้ ใช้ Network Segmentation แยก VLAN สำหรับแต่ละแผนก ติดตั้ง IDS/IPS เพื่อตรวจจับการโจมตี และทำ Regular Security Audit อย่างน้อยปีละ 2 ครั้ง
BigQuery Scheduled Query คืออะไร
ตั้งเวลารัน SQL อัตโนมัติ Cron Schedule Destination Table WRITE_TRUNCATE APPEND Parameter @run_date ETL Aggregation Report Retry
Observability Stack คืออะไร
Metrics Duration Bytes Slots Logs Audit Error Traces Pipeline Alerts Fail Slow Cost Dashboards Cloud Monitoring Logging Grafana Datadog
ตั้งค่า Monitoring อย่างไร
Cloud Monitoring Dashboard Alert Policy Cloud Logging Audit Log INFORMATION_SCHEMA.JOBS Duration Bytes State Error Data Quality Check Row Count Freshness
Best Practices มีอะไร
Alert ทุก Fail Duration SLA Bytes Budget Data Quality Check Terraform Code INFORMATION_SCHEMA Performance Runbook Slack PagerDuty Jira
สรุป
BigQuery Scheduled Query Observability Monitoring Alerting Logging Dashboard INFORMATION_SCHEMA Data Quality ETL Cost Control Production Pipeline
