SQLite Litestream Post-mortem Analysis — Backup

SQLite Litestream Post-mortem

SQLite Litestream WAL Streaming Real-time Replication S3 Backup Recovery Post-mortem Analysis Root Cause Incident RPO RTO Durability Embedded Database

Backup Method	RPO	RTO	Complexity	Cost
Litestream	~1 second	~10 seconds	ต่ำ	ต่ำมาก
sqlite3 .backup	Hours	Minutes	ต่ำ	ต่ำ
File Copy (cron)	Hours	Minutes	ต่ำ	ต่ำ
PostgreSQL Streaming	~1 second	~30 seconds	สูง	สูง

Litestream Setup

=== Litestream Installation & Configuration ===

อ่านเพิ่ม: Voice Cloning Low Code No Code | SiamCafe Blog · อ่านเพิ่ม: Ansible Automation สำหรับมือใหม่ จัดการ Server อัตโนมัติ · อ่านเพิ่ม: Proxmox VE Cluster ทำ High Availability สำหรับ Home Lab

Install

wget https://github.com/benbjohnson/litestream/releases/latest/download/litestream-linux-amd64.deb

sudo dpkg -i litestream-linux-amd64.deb

Configuration — /etc/litestream.yml

dbs:

path: /data/app.db

replicas:

type: s3

bucket: my-backup-bucket

path: app-db

region: ap-southeast-1

access-key-id:

secret-access-key:

retention: 72h

sync-interval: 1s

type: sftp

host: backup-server:22

user: backup

key-path: /etc/litestream/backup-key

path: /backups/app.db

Run as Service

sudo systemctl enable litestream

sudo systemctl start litestream

เนื้อหาเกี่ยวข้อง — อ่านต่อ: cấy ốc tai điện tử bao nhiêu tiền

Docker Compose

services:

app:

build: .

volumes: [data:/data]

litestream:

image: litestream/litestream:latest

volumes:

data:/data
./litestream.yml:/etc/litestream.yml

command: replicate

environment:

AWS_ACCESS_KEY_ID=
AWS_SECRET_ACCESS_KEY=

Verify Replication

แนะนำเพิ่มเติม — XM Signal

litestream replicas list /data/app.db

litestream wal list /data/app.db

from dataclasses import dataclass

@dataclass

class ReplicaStatus:

db_path: str

replica_type: str

destination: str

last_sync: str

lag_ms: int

status: str

replicas = [

ReplicaStatus("/data/app.db", "S3", "s3://backup/app-db", "10:30:01", 150, "Syncing"),

ReplicaStatus("/data/app.db", "SFTP", "backup-server:/backups", "10:30:00", 1200, "Syncing"),

เนื้อหาเกี่ยวข้อง — data structure w3schools

ReplicaStatus("/data/blog.db", "S3", "s3://backup/blog-db", "10:30:01", 80, "Syncing"),

]

print("=== Litestream Replica Status ===")

for r in replicas:

print(f" [{r.status}] {r.db_path} -> {r.replica_type}")

print(f" Destination: {r.destination} | Lag: {r.lag_ms}ms | Last: {r.last_sync}")

Recovery และ Testing

=== Disaster Recovery ===

Restore Latest

litestream restore -o /data/restored.db s3://backup/app-db

Point-in-time Recovery

litestream restore -o /data/restored.db \

-timestamp "2024-03-15T10:30:00Z" \

แนะนำเพิ่มเติม — คู่มือเทรดจาก SiamCafeBook

s3://backup/app-db

Verify Integrity

sqlite3 /data/restored.db "PRAGMA integrity_check;"

sqlite3 /data/restored.db "SELECT count(*) FROM users;"

Automated Recovery Test (cron weekly)

#!/bin/bash

RESTORE_PATH="/tmp/recovery-test-$(date +%Y%m%d).db"

litestream restore -o "$RESTORE_PATH" s3://backup/app-db

เนื้อหาเกี่ยวข้อง — ทำความเข้าใจ ArgoCD ApplicationSet Database Migration

INTEGRITY=$(sqlite3 "$RESTORE_PATH" "PRAGMA integrity_check;")

ROW_COUNT=$(sqlite3 "$RESTORE_PATH" "SELECT count(*) FROM users;")

if [ "$INTEGRITY" = "ok" ] && [ "$ROW_COUNT" -gt 0 ]; then

echo "RECOVERY TEST PASSED: $ROW_COUNT rows, integrity OK"

curl -X POST "$SLACK_WEBHOOK" \

-d "{\"text\":\"Recovery test PASSED: $ROW_COUNT rows\"}"

else

echo "RECOVERY TEST FAILED"

curl -X POST "$PAGERDUTY_WEBHOOK" \

-d "{\"text\":\"CRITICAL: Recovery test FAILED\"}"

rm -f "$RESTORE_PATH"

@dataclass

class RecoveryTest:

date: str

db: str

rows_restored: int

integrity: str

duration_sec: float

status: str

เนื้อหาเกี่ยวข้อง — บทความที่เกี่ยวข้อง: บอร์ด iot คืออะไร — ทุกสิ่งที่ต้องรู้ในปี 2026

tests = [

RecoveryTest("2024-03-15", "app.db", 150000, "OK", 8.5, "PASSED"),

RecoveryTest("2024-03-08", "app.db", 148500, "OK", 7.2, "PASSED"),

RecoveryTest("2024-03-01", "app.db", 145000, "OK", 6.8, "PASSED"),

RecoveryTest("2024-03-15", "blog.db", 5200, "OK", 1.2, "PASSED"),

]

print("\n=== Recovery Test History ===")

for t in tests:

print(f" [{t.status}] {t.date} — {t.db}")

print(f" Rows: {t.rows_restored:,} | Integrity: {t.integrity} | Duration: {t.duration_sec}s")

Post-mortem Template

# === Post-mortem: Database Corruption Incident ===



# ## Incident: SQLite Database Corruption

# **Date:** 2024-03-10

# **Duration:** 25 minutes

# **Severity:** P1 - Critical

# **Author:** Platform Team

#

# ### Timeline

# - 14:00 — Alert: App returning 500 errors

# - 14:03 — On-call acknowledges, starts investigation

# - 14:05 — Found: SQLite database file corrupted

# - 14:08 — Root cause: Disk full caused WAL corruption

# - 14:10 — Decision: Restore from Litestream backup

# - 14:12 — litestream restore started

# - 14:13 — Database restored (8.5 seconds)

# - 14:15 — App restarted with restored database

# - 14:20 — Verified: All data intact, 0 rows lost

# - 14:25 — All services healthy, incident resolved

#

# ### Root Cause

# Disk reached 100% capacity due to unrotated log files

# SQLite WAL could not be checkpointed, causing corruption

#

# ### Impact

# - 25 minutes downtime

# - ~500 failed requests

# - 0 data loss (Litestream RPO < 1 second)

#

# ### Action Items

# 1. [DONE] Add disk usage monitoring alert at 80%

# 2. [DONE] Configure log rotation (logrotate)

# 3. [TODO] Add automatic disk cleanup script

# 4. [TODO] Schedule weekly recovery tests

# 5. [DONE] Document recovery runbook



@dataclass

class ActionItem:

    id: int

    action: str

    owner: str

    status: str

    due_date: str



actions = [

    ActionItem(1, "Disk usage alert at 80%", "SRE Team", "Done", "2024-03-11"),

    ActionItem(2, "Configure logrotate", "SRE Team", "Done", "2024-03-11"),

    ActionItem(3, "Auto disk cleanup script", "Platform", "In Progress", "2024-03-20"),

    ActionItem(4, "Weekly recovery test cron", "SRE Team", "Done", "2024-03-12"),

    ActionItem(5, "Recovery runbook doc", "SRE Team", "Done", "2024-03-12"),

]



print("Post-mortem Action Items:")

for a in actions:

    print(f"  [{a.status}] #{a.id}: {a.action}")

    print(f"    Owner: {a.owner} | Due: {a.due_date}")



# Metrics After Incident

improvements = {

    "RPO": "< 1 second (unchanged)",

    "RTO": "25 min -> 2 min (with runbook)",

    "MTTR": "25 min -> 5 min (automated recovery)",

    "Recovery Tests": "0 -> Weekly automated",

    "Disk Alerts": "None -> 80% threshold",

    "Data Loss": "0 rows (Litestream saved us)",

}



print(f"\n\nImprovements After Incident:")

for metric, value in improvements.items():

    print(f"  {metric}: {value}")

เคล็ดลับ

sync-interval: ตั้ง 1s สำหรับ RPO ต่ำสุด
retention: เก็บ 72h+ สำหรับ Point-in-time Recovery
Test: ทดสอบ Recovery ทุกสัปดาห์อัตโนมัติ
Monitor: Monitor Replication Lag และ Disk Usage
Blameless: Post-mortem ต้อง Blameless โฟกัสที่ระบบ

Best Practices สำหรับนักพัฒนา

การเขียนโค้ดที่ดีไม่ใช่แค่ทำให้โปรแกรมทำงานได้ แต่ต้องเขียนให้อ่านง่าย ดูแลรักษาง่าย และ Scale ได้ หลัก SOLID Principles เป็นพื้นฐานสำคัญที่นักพัฒนาทุกคนควรเข้าใจ ได้แก่ Single Responsibility ที่แต่ละ Class ทำหน้าที่เดียว Open-Closed ที่เปิดให้ขยายแต่ปิดการแก้ไข Liskov Substitution ที่ Subclass ต้องใช้แทน Parent ได้ Interface Segregation ที่แยก Interface ให้เล็ก และ Dependency Inversion ที่พึ่งพา Abstraction ไม่ใช่ Implementation

เรื่อง Testing ก็ขาดไม่ได้ ควรเขียน Unit Test ครอบคลุมอย่างน้อย 80% ของ Code Base ใช้ Integration Test ทดสอบการทำงานร่วมกันของ Module ต่างๆ และ E2E Test สำหรับ Critical User Flow เครื่องมือยอดนิยมเช่น Jest, Pytest, JUnit ช่วยให้การเขียน Test เป็นเรื่องง่าย

เรื่อง Version Control ด้วย Git ใช้ Branch Strategy ที่เหมาะกับทีม เช่น Git Flow สำหรับโปรเจคใหญ่ หรือ Trunk-Based Development สำหรับทีมที่ Deploy บ่อย ทำ Code Review ทุก Pull Request และใช้ CI/CD Pipeline ทำ Automated Testing และ Deployment

Litestream คืออะไร

Real-time SQLite Replication S3 GCS WAL Streaming ไม่หยุด App Recovery วินาที Binary เดียว Durability ง่าย

ทำไมใช้ Litestream กับ SQLite

SQLite ไม่มี Replication Disk เสียข้อมูลหาย Litestream WAL ไป S3 RPO ~0 RTO วินาที ราคาถูก ไม่ต้อง PostgreSQL

Post-mortem Analysis คืออะไร

วิเคราะห์หลัง Incident Blameless Timeline Root Cause Impact Action Items Follow-up DevOps SRE Culture ป้องกันเกิดซ้ำ

Recovery จาก Litestream ทำอย่างไร

litestream restore -o ดึงจาก S3 Point-in-time -timestamp WAL Segments integrity_check Verify ทดสอบเป็นประจำ

สรุป

SQLite Litestream WAL Real-time Replication S3 Backup Recovery RPO RTO Post-mortem Blameless Root Cause Action Items Disk Monitoring logrotate Durability Integrity