SQLite Litestream Observability Stack

SQLite Litestream Observability

SQLite Litestream Observability Stack Replication Prometheus Grafana Alertmanager Loki WAL S3 Metrics Dashboard Monitoring

Component	Role	Port	Data
Litestream	SQLite Replication + Metrics	9090 (metrics)	WAL → S3/GCS
Prometheus	Metrics Collection	9090	Time Series Metrics
Grafana	Dashboard + Alerting	3000	Visualization
Alertmanager	Alert Routing	9093	Slack Email PagerDuty
Loki	Log Aggregation	3100	Litestream Logs

Litestream Configuration

# === Litestream Setup ===

# litestream.yml
# dbs:
#   - path: /data/app.db
#     replicas:
#       - type: s3
#         bucket: my-backup-bucket
#         path: litestream/app.db
#         region: ap-southeast-1
#         retention: 720h  # 30 days
#         sync-interval: 1s
#         snapshot-interval: 24h
#       - type: gcs
#         bucket: my-gcs-backup
#         path: litestream/app.db
#         retention: 720h
#
# # Metrics endpoint
# addr: ":9090"

# systemd service
# [Unit]
# Description=Litestream Replication
# After=network.target
#
# [Service]
# ExecStart=/usr/local/bin/litestream replicate -config /etc/litestream.yml
# Restart=always
# RestartSec=5
# Environment=AWS_ACCESS_KEY_ID=xxx
# Environment=AWS_SECRET_ACCESS_KEY=xxx
#
# [Install]
# WantedBy=multi-user.target

# Kubernetes Sidecar
# containers:
#   - name: app
#     image: my-app:latest
#     volumeMounts:
#       - name: data
#         mountPath: /data
#   - name: litestream
#     image: litestream/litestream:latest
#     args: ["replicate", "-config", "/etc/litestream.yml"]
#     volumeMounts:
#       - name: data
#         mountPath: /data
#       - name: config
#         mountPath: /etc/litestream.yml
#         subPath: litestream.yml

from dataclasses import dataclass

@dataclass
class LitestreamConfig:
    setting: str
    value: str
    purpose: str
    production_tip: str

configs = [
    LitestreamConfig("replicas.type",
        "s3 / gcs / abs / sftp",
        "Storage Backend สำหรับ Replica",
        "ใช้ S3 สำหรับ AWS, GCS สำหรับ GCP"),
    LitestreamConfig("retention",
        "720h (30 วัน)",
        "เก็บ WAL Segments กี่นาน",
        "เพิ่มถ้าต้อง Point-in-time Recovery ย้อนไกล"),
    LitestreamConfig("sync-interval",
        "1s (default 1s)",
        "ความถี่ Sync WAL ไป Replica",
        "1s ดี ถ้า Network ช้า เพิ่มเป็น 10s"),
    LitestreamConfig("snapshot-interval",
        "24h",
        "สร้าง Full Snapshot ทุก 24 ชั่วโมง",
        "Snapshot เร่ง Restore ไม่ต้อง Replay WAL ทั้งหมด"),
    LitestreamConfig("addr",
        ":9090",
        "Prometheus Metrics Endpoint",
        "ตั้ง Firewall ไม่เปิด Public"),
]

print("=== Litestream Config ===")
for c in configs:
    print(f"  [{c.setting}] = {c.value}")
    print(f"    Purpose: {c.purpose}")
    print(f"    Tip: {c.production_tip}")

Grafana Dashboard

# === Grafana Dashboard Panels ===

@dataclass
class DashboardPanel:
    panel: str
    metric: str
    query: str
    alert: str

panels = [
    DashboardPanel("Database Size",
        "litestream_db_size_bytes",
        "litestream_db_size_bytes / 1024 / 1024",
        "> 1GB Warning, > 5GB Critical"),
    DashboardPanel("WAL Size",
        "litestream_wal_size_bytes",
        "litestream_wal_size_bytes / 1024 / 1024",
        "> 100MB Warning (WAL ไม่ถูก Checkpoint)"),
    DashboardPanel("Replication Lag",
        "litestream_replica_lag_seconds",
        "litestream_replica_lag_seconds",
        "> 30s Warning, > 60s Critical"),
    DashboardPanel("Replica Throughput",
        "litestream_replica_bytes_total",
        "rate(litestream_replica_bytes_total[5m])",
        "= 0 for 5m Critical (Replication หยุด)"),
    DashboardPanel("WAL Operations",
        "litestream_wal_operations_total",
        "rate(litestream_wal_operations_total[5m])",
        "Spike ผิดปกติ ตรวจ Application"),
    DashboardPanel("Snapshot Count",
        "litestream_snapshot_count",
        "litestream_snapshot_count",
        "ไม่เพิ่มใน 48h Warning (Snapshot ไม่ทำงาน)"),
]

print("=== Dashboard Panels ===")
for p in panels:
    print(f"  [{p.panel}] Metric: {p.metric}")
    print(f"    Query: {p.query}")
    print(f"    Alert: {p.alert}")

Alert & Recovery

# === Alert Rules & Recovery Procedure ===

@dataclass
class AlertRule:
    alert: str
    expr: str
    duration: str
    severity: str
    runbook: str

alerts = [
    AlertRule("LitestreamDown",
        "up{job='litestream'} == 0",
        "1m",
        "critical",
        "ตรวจ Process Status, Restart systemd service, Check Logs"),
    AlertRule("ReplicaLagHigh",
        "litestream_replica_lag_seconds > 60",
        "5m",
        "critical",
        "ตรวจ Network, S3 Connectivity, Disk I/O, Litestream Logs"),
    AlertRule("DatabaseSizeLarge",
        "litestream_db_size_bytes > 5e9",
        "30m",
        "warning",
        "ตรวจ Data Growth, VACUUM Database, Archive Old Data"),
    AlertRule("WALSizeLarge",
        "litestream_wal_size_bytes > 100e6",
        "10m",
        "warning",
        "ตรวจ WAL Checkpoint, Long-running Transactions"),
    AlertRule("ReplicationStopped",
        "rate(litestream_replica_bytes_total[10m]) == 0",
        "10m",
        "critical",
        "ตรวจ Litestream Process, S3 Permissions, Network"),
]

@dataclass
class RecoveryStep:
    step: int
    action: str
    command: str
    duration: str

recovery = [
    RecoveryStep(1, "Stop Application",
        "systemctl stop my-app",
        "5 วินาที"),
    RecoveryStep(2, "Restore Database",
        "litestream restore -config /etc/litestream.yml /data/app.db",
        "1-10 นาที (ขึ้นกับขนาด)"),
    RecoveryStep(3, "Verify Database",
        "sqlite3 /data/app.db 'PRAGMA integrity_check;'",
        "10 วินาที"),
    RecoveryStep(4, "Start Application",
        "systemctl start my-app",
        "5 วินาที"),
    RecoveryStep(5, "Start Litestream",
        "systemctl start litestream",
        "5 วินาที"),
    RecoveryStep(6, "Verify Replication",
        "curl localhost:9090/metrics | grep replica_lag",
        "30 วินาที"),
]

print("=== Alert Rules ===")
for a in alerts:
    print(f"  [{a.alert}] Severity: {a.severity}")
    print(f"    Expr: {a.expr} for {a.duration}")
    print(f"    Runbook: {a.runbook}")

print("\n=== Recovery Steps ===")
for r in recovery:
    print(f"  Step {r.step}: {r.action} ({r.duration})")
    print(f"    Command: {r.command}")

เคล็ดลับ

Sidecar: ใช้ Litestream เป็น Sidecar Container ใน Kubernetes
Multi-replica: ส่งไปหลาย Storage พร้อมกัน S3 + GCS
Snapshot: ตั้ง Snapshot Interval 24h เร่ง Restore
VACUUM: รัน VACUUM เป็นระยะ ลดขนาด Database
Test: ทดสอบ Restore ทุกเดือน ไม่ใช่แค่ Backup

Best Practices สำหรับนักพัฒนา

การเขียนโค้ดที่ดีไม่ใช่แค่ทำให้โปรแกรมทำงานได้ แต่ต้องเขียนให้อ่านง่าย ดูแลรักษาง่าย และ Scale ได้ หลัก SOLID Principles เป็นพื้นฐานสำคัญที่นักพัฒนาทุกู้คืนควรเข้าใจ ได้แก่ Single Responsibility ที่แต่ละ Class ทำหน้าที่เดียว Open-Closed ที่เปิดให้ขยายแต่ปิดการแก้ไข Liskov Substitution ที่ Subclass ต้องใช้แทน Parent ได้ Interface Segregation ที่แยก Interface ให้เล็ก และ Dependency Inversion ที่พึ่งพา Abstraction ไม่ใช่ Implementation

เรื่อง Testing ก็ขาดไม่ได้ ควรเขียน Unit Test ครอบคลุมอย่างน้อย 80% ของ Code Base ใช้ Integration Test ทดสอบการทำงานร่วมกันของ Module ต่างๆ และ E2E Test สำหรับ Critical User Flow เครื่องมือยอดนิยมเช่น Jest, Pytest, JUnit ช่วยให้การเขียน Test เป็นเรื่องง่าย

เรื่อง Version Control ด้วย Git ใช้ Branch Strategy ที่เหมาะกับทีม เช่น Git Flow สำหรับโปรเจคใหญ่ หรือ Trunk-Based Development สำหรับทีมที่ Deploy บ่อย ทำ Code Review ทุก Pull Request และใช้ CI/CD Pipeline ทำ Automated Testing และ Deployment

Litestream คืออะไร

SQLite Streaming Replication WAL S3 GCS Real-time Backup Sidecar Low Resource RPO ~0 Restore คำสั่งเดียว Open Source Edge IoT

Observability Stack ประกอบด้วยอะไร

Litestream Metrics Prometheus Grafana Dashboard Alertmanager Loki Logs db_size wal_size replica_lag throughput Snapshot Alert

ตั้งค่า Replication อย่างไร

litestream.yml dbs path replicas type s3 gcs retention sync-interval snapshot-interval systemd Kubernetes Sidecar Volume Share Restore

Alert ตั้งอย่างไร

Prometheus Alert Rules replica_lag > 60s LitestreamDown DatabaseSize WAL ReplicationStopped Slack PagerDuty Runbook Recovery Restore

สรุป

SQLite Litestream Observability Stack Replication S3 WAL Prometheus Grafana Alertmanager Loki Dashboard Alert Recovery Production

SQLite Litestream Observability Stack — ระบบ