SiamCafe.net Blog
Cybersecurity

SQLite Litestream Post-mortem Analysis

sqlite litestream post mortem analysis
SQLite Litestream Post-mortem Analysis | SiamCafe Blog
2025-09-27· อ. บอม — SiamCafe.net· 11,693 คำ

SQLite Litestream Post-mortem

SQLite Litestream WAL Streaming Real-time Replication S3 Backup Recovery Post-mortem Analysis Root Cause Incident RPO RTO Durability Embedded Database

Backup MethodRPORTOComplexityCost
Litestream~1 second~10 secondsต่ำต่ำมาก
sqlite3 .backupHoursMinutesต่ำต่ำ
File Copy (cron)HoursMinutesต่ำต่ำ
PostgreSQL Streaming~1 second~30 secondsสูงสูง

Litestream Setup

# === Litestream Installation & Configuration ===

# Install
# wget https://github.com/benbjohnson/litestream/releases/latest/download/litestream-linux-amd64.deb
# sudo dpkg -i litestream-linux-amd64.deb

# Configuration — /etc/litestream.yml
# dbs:
#   - path: /data/app.db
#     replicas:
#       - type: s3
#         bucket: my-backup-bucket
#         path: app-db
#         region: ap-southeast-1
#         access-key-id: 
#         secret-access-key: 
#         retention: 72h
#         sync-interval: 1s
#
#       - type: sftp
#         host: backup-server:22
#         user: backup
#         key-path: /etc/litestream/backup-key
#         path: /backups/app.db

# Run as Service
# sudo systemctl enable litestream
# sudo systemctl start litestream

# Docker Compose
# services:
#   app:
#     build: .
#     volumes: [data:/data]
#   litestream:
#     image: litestream/litestream:latest
#     volumes:
#       - data:/data
#       - ./litestream.yml:/etc/litestream.yml
#     command: replicate
#     environment:
#       - AWS_ACCESS_KEY_ID=
#       - AWS_SECRET_ACCESS_KEY=

# Verify Replication
# litestream replicas list /data/app.db
# litestream wal list /data/app.db

from dataclasses import dataclass

@dataclass
class ReplicaStatus:
    db_path: str
    replica_type: str
    destination: str
    last_sync: str
    lag_ms: int
    status: str

replicas = [
    ReplicaStatus("/data/app.db", "S3", "s3://backup/app-db", "10:30:01", 150, "Syncing"),
    ReplicaStatus("/data/app.db", "SFTP", "backup-server:/backups", "10:30:00", 1200, "Syncing"),
    ReplicaStatus("/data/blog.db", "S3", "s3://backup/blog-db", "10:30:01", 80, "Syncing"),
]

print("=== Litestream Replica Status ===")
for r in replicas:
    print(f"  [{r.status}] {r.db_path} -> {r.replica_type}")
    print(f"    Destination: {r.destination} | Lag: {r.lag_ms}ms | Last: {r.last_sync}")

Recovery และ Testing

# === Disaster Recovery ===

# Restore Latest
# litestream restore -o /data/restored.db s3://backup/app-db

# Point-in-time Recovery
# litestream restore -o /data/restored.db \
#   -timestamp "2024-03-15T10:30:00Z" \
#   s3://backup/app-db

# Verify Integrity
# sqlite3 /data/restored.db "PRAGMA integrity_check;"
# sqlite3 /data/restored.db "SELECT count(*) FROM users;"

# Automated Recovery Test (cron weekly)
# #!/bin/bash
# RESTORE_PATH="/tmp/recovery-test-$(date +%Y%m%d).db"
# litestream restore -o "$RESTORE_PATH" s3://backup/app-db
#
# INTEGRITY=$(sqlite3 "$RESTORE_PATH" "PRAGMA integrity_check;")
# ROW_COUNT=$(sqlite3 "$RESTORE_PATH" "SELECT count(*) FROM users;")
#
# if [ "$INTEGRITY" = "ok" ] && [ "$ROW_COUNT" -gt 0 ]; then
#   echo "RECOVERY TEST PASSED: $ROW_COUNT rows, integrity OK"
#   curl -X POST "$SLACK_WEBHOOK" \
#     -d "{\"text\":\"Recovery test PASSED: $ROW_COUNT rows\"}"
# else
#   echo "RECOVERY TEST FAILED"
#   curl -X POST "$PAGERDUTY_WEBHOOK" \
#     -d "{\"text\":\"CRITICAL: Recovery test FAILED\"}"
# fi
# rm -f "$RESTORE_PATH"

@dataclass
class RecoveryTest:
    date: str
    db: str
    rows_restored: int
    integrity: str
    duration_sec: float
    status: str

tests = [
    RecoveryTest("2024-03-15", "app.db", 150000, "OK", 8.5, "PASSED"),
    RecoveryTest("2024-03-08", "app.db", 148500, "OK", 7.2, "PASSED"),
    RecoveryTest("2024-03-01", "app.db", 145000, "OK", 6.8, "PASSED"),
    RecoveryTest("2024-03-15", "blog.db", 5200, "OK", 1.2, "PASSED"),
]

print("\n=== Recovery Test History ===")
for t in tests:
    print(f"  [{t.status}] {t.date} — {t.db}")
    print(f"    Rows: {t.rows_restored:,} | Integrity: {t.integrity} | Duration: {t.duration_sec}s")

Post-mortem Template

# === Post-mortem: Database Corruption Incident ===

# ## Incident: SQLite Database Corruption
# **Date:** 2024-03-10
# **Duration:** 25 minutes
# **Severity:** P1 - Critical
# **Author:** Platform Team
#
# ### Timeline
# - 14:00 — Alert: App returning 500 errors
# - 14:03 — On-call acknowledges, starts investigation
# - 14:05 — Found: SQLite database file corrupted
# - 14:08 — Root cause: Disk full caused WAL corruption
# - 14:10 — Decision: Restore from Litestream backup
# - 14:12 — litestream restore started
# - 14:13 — Database restored (8.5 seconds)
# - 14:15 — App restarted with restored database
# - 14:20 — Verified: All data intact, 0 rows lost
# - 14:25 — All services healthy, incident resolved
#
# ### Root Cause
# Disk reached 100% capacity due to unrotated log files
# SQLite WAL could not be checkpointed, causing corruption
#
# ### Impact
# - 25 minutes downtime
# - ~500 failed requests
# - 0 data loss (Litestream RPO < 1 second)
#
# ### Action Items
# 1. [DONE] Add disk usage monitoring alert at 80%
# 2. [DONE] Configure log rotation (logrotate)
# 3. [TODO] Add automatic disk cleanup script
# 4. [TODO] Schedule weekly recovery tests
# 5. [DONE] Document recovery runbook

@dataclass
class ActionItem:
    id: int
    action: str
    owner: str
    status: str
    due_date: str

actions = [
    ActionItem(1, "Disk usage alert at 80%", "SRE Team", "Done", "2024-03-11"),
    ActionItem(2, "Configure logrotate", "SRE Team", "Done", "2024-03-11"),
    ActionItem(3, "Auto disk cleanup script", "Platform", "In Progress", "2024-03-20"),
    ActionItem(4, "Weekly recovery test cron", "SRE Team", "Done", "2024-03-12"),
    ActionItem(5, "Recovery runbook doc", "SRE Team", "Done", "2024-03-12"),
]

print("Post-mortem Action Items:")
for a in actions:
    print(f"  [{a.status}] #{a.id}: {a.action}")
    print(f"    Owner: {a.owner} | Due: {a.due_date}")

# Metrics After Incident
improvements = {
    "RPO": "< 1 second (unchanged)",
    "RTO": "25 min -> 2 min (with runbook)",
    "MTTR": "25 min -> 5 min (automated recovery)",
    "Recovery Tests": "0 -> Weekly automated",
    "Disk Alerts": "None -> 80% threshold",
    "Data Loss": "0 rows (Litestream saved us)",
}

print(f"\n\nImprovements After Incident:")
for metric, value in improvements.items():
    print(f"  {metric}: {value}")

เคล็ดลับ

Best Practices สำหรับนักพัฒนา

การเขียนโค้ดที่ดีไม่ใช่แค่ทำให้โปรแกรมทำงานได้ แต่ต้องเขียนให้อ่านง่าย ดูแลรักษาง่าย และ Scale ได้ หลัก SOLID Principles เป็นพื้นฐานสำคัญที่นักพัฒนาทุกู้คืนควรเข้าใจ ได้แก่ Single Responsibility ที่แต่ละ Class ทำหน้าที่เดียว Open-Closed ที่เปิดให้ขยายแต่ปิดการแก้ไข Liskov Substitution ที่ Subclass ต้องใช้แทน Parent ได้ Interface Segregation ที่แยก Interface ให้เล็ก และ Dependency Inversion ที่พึ่งพา Abstraction ไม่ใช่ Implementation

เรื่อง Testing ก็ขาดไม่ได้ ควรเขียน Unit Test ครอบคลุมอย่างน้อย 80% ของ Code Base ใช้ Integration Test ทดสอบการทำงานร่วมกันของ Module ต่างๆ และ E2E Test สำหรับ Critical User Flow เครื่องมือยอดนิยมเช่น Jest, Pytest, JUnit ช่วยให้การเขียน Test เป็นเรื่องง่าย

เรื่อง Version Control ด้วย Git ใช้ Branch Strategy ที่เหมาะกับทีม เช่น Git Flow สำหรับโปรเจคใหญ่ หรือ Trunk-Based Development สำหรับทีมที่ Deploy บ่อย ทำ Code Review ทุก Pull Request และใช้ CI/CD Pipeline ทำ Automated Testing และ Deployment

Litestream คืออะไร

Real-time SQLite Replication S3 GCS WAL Streaming ไม่หยุด App Recovery วินาที Binary เดียว Durability ง่าย

ทำไมใช้ Litestream กับ SQLite

SQLite ไม่มี Replication Disk เสียข้อมูลหาย Litestream WAL ไป S3 RPO ~0 RTO วินาที ราคาถูก ไม่ต้อง PostgreSQL

Post-mortem Analysis คืออะไร

วิเคราะห์หลัง Incident Blameless Timeline Root Cause Impact Action Items Follow-up DevOps SRE Culture ป้องกันเกิดซ้ำ

Recovery จาก Litestream ทำอย่างไร

litestream restore -o ดึงจาก S3 Point-in-time -timestamp WAL Segments integrity_check Verify ทดสอบเป็นประจำ

สรุป

SQLite Litestream WAL Real-time Replication S3 Backup Recovery RPO RTO Post-mortem Blameless Root Cause Action Items Disk Monitoring logrotate Durability Integrity

📖 บทความที่เกี่ยวข้อง

Fivetran Connector Post-mortem Analysisอ่านบทความ → Mintlify Docs Post-mortem Analysisอ่านบทความ → SQLite Litestream Docker Container Deployอ่านบทความ → SQLite Litestream DNS Managementอ่านบทความ → Go GORM Post-mortem Analysisอ่านบทความ →

📚 ดูบทความทั้งหมด →