SiamCafe.net Blog
Technology

Web Components Disaster Recovery Plan

web components disaster recovery plan
Web Components Disaster Recovery Plan | SiamCafe Blog
2025-08-26· อ. บอม — SiamCafe.net· 9,651 คำ

DR Plan

Web Components Disaster Recovery Plan Backup RTO RPO Failover Architecture Recovery Testing 3-2-1 Rule Replication Off-site Chaos Engineering Production

ComponentRTORPOBackup MethodRecovery Method
Database15 min5 minWAL + Streaming ReplicaPromote Replica
Application5 min0 (stateless)Container Image RegistryRedeploy from Registry
Static Files10 min1 hrS3 Cross-region ReplicationSwitch CDN Origin
Config/Secrets5 min0Git + VaultReapply from Git/Vault
DNS5 minN/AMultiple NS providersUpdate DNS records
SSL Certs5 minN/ACert Manager + VaultReissue or restore

Backup Strategy

# === Backup Configuration ===

# PostgreSQL WAL Archiving
# postgresql.conf:
# wal_level = replica
# archive_mode = on
# archive_command = 'aws s3 cp %p s3://db-backup/wal/%f'
# max_wal_senders = 5

# Streaming Replication Setup
# primary: pg_hba.conf
# host replication replicator 10.0.0.0/8 md5
#
# replica:
# pg_basebackup -h primary -D /var/lib/postgresql/data -U replicator -P -R

# Daily Full Backup
# #!/bin/bash
# DATE=$(date +%Y%m%d)
# pg_dump -Fc mydb > /backup/mydb_.dump
# aws s3 cp /backup/mydb_.dump s3://db-backup/daily/
# find /backup -name "*.dump" -mtime +30 -delete

# File Backup with rsync
# rsync -avz --delete /var/www/html/ backup-server:/backup/www/
# aws s3 sync /var/www/uploads s3://file-backup/uploads/ --delete

# Config Backup
# git add -A && git commit -m "config backup $(date)" && git push

from dataclasses import dataclass

@dataclass
class BackupJob:
    component: str
    method: str
    frequency: str
    retention: str
    destination: str
    encryption: str
    test_frequency: str

jobs = [
    BackupJob("PostgreSQL Full", "pg_dump", "Daily 02:00", "30 days", "S3 ap-southeast-1", "AES-256", "Monthly"),
    BackupJob("PostgreSQL WAL", "WAL Archiving", "Continuous", "7 days", "S3 ap-southeast-1", "AES-256", "Monthly"),
    BackupJob("PostgreSQL Replica", "Streaming", "Real-time", "N/A", "DR Site Tokyo", "TLS", "Quarterly"),
    BackupJob("Uploads/Media", "S3 Sync", "Hourly", "90 days", "S3 cross-region", "SSE-S3", "Monthly"),
    BackupJob("Config/IaC", "Git Push", "On change", "Unlimited", "GitHub Private", "SSH", "On deploy"),
    BackupJob("Secrets", "Vault Snapshot", "Daily", "30 days", "S3 encrypted", "AES-256", "Monthly"),
]

print("=== Backup Jobs ===")
for j in jobs:
    print(f"  [{j.component}] Method: {j.method}")
    print(f"    Freq: {j.frequency} | Retention: {j.retention}")
    print(f"    Dest: {j.destination} | Encrypt: {j.encryption}")
    print(f"    Test: {j.test_frequency}")

Failover Architecture

# === Multi-region Failover ===

# DNS Failover (Route53 / Cloudflare)
# Primary: ap-southeast-1 (Singapore) — Active
# DR: ap-northeast-1 (Tokyo) — Standby
# Health Check: HTTP /health every 30s
# Failover: Auto switch DNS when primary unhealthy

# Kubernetes Multi-cluster
# kubectl config use-context primary-cluster
# kubectl get pods  # Running workloads
#
# kubectl config use-context dr-cluster
# kubectl get pods  # Standby workloads (scaled to 0 or warm)
#
# Failover script:
# kubectl --context dr-cluster scale deploy/web --replicas=3
# kubectl --context dr-cluster scale deploy/api --replicas=3
# # Update DNS to DR cluster IP

# Database Failover
# On DR site:
# pg_ctl promote -D /var/lib/postgresql/data
# # Replica becomes new primary

@dataclass
class FailoverStep:
    step: int
    action: str
    responsible: str
    time_estimate: str
    automated: bool
    verification: str

steps = [
    FailoverStep(1, "Detect failure — monitoring alert", "Monitoring System", "1-2 min", True, "Alert received in Opsgenie"),
    FailoverStep(2, "Assess scope — determine affected components", "On-call Engineer", "2-5 min", False, "Impact assessment complete"),
    FailoverStep(3, "Activate DR — switch DNS to DR site", "On-call Engineer", "1-2 min", True, "DNS propagated"),
    FailoverStep(4, "Promote DB replica", "DBA / Script", "2-5 min", True, "DB accepting writes"),
    FailoverStep(5, "Scale DR workloads", "DevOps / Script", "2-3 min", True, "All pods running"),
    FailoverStep(6, "Verify services — health checks pass", "On-call Engineer", "3-5 min", True, "All health checks green"),
    FailoverStep(7, "Notify stakeholders", "Incident Commander", "5 min", True, "Status page updated"),
    FailoverStep(8, "Monitor DR site", "Team", "Ongoing", False, "Metrics stable 30 min"),
]

print("=== Failover Runbook ===")
total_time = 0
for s in steps:
    auto = "Auto" if s.automated else "Manual"
    print(f"  Step {s.step}: {s.action}")
    print(f"    Who: {s.responsible} | Time: {s.time_estimate} | {auto}")
    print(f"    Verify: {s.verification}")

DR Testing

# === DR Test Plan ===

@dataclass
class DRTest:
    test_type: str
    frequency: str
    scope: str
    duration: str
    participants: str
    success_criteria: str

tests = [
    DRTest("Tabletop Exercise", "Monthly", "Discuss scenarios, review runbook", "1 hour",
        "All engineers + management", "Everyone knows their role"),
    DRTest("Backup Restore", "Monthly", "Restore DB from latest backup", "30 min",
        "DBA + DevOps", "DB restored, data verified"),
    DRTest("Partial Failover", "Quarterly", "Failover one component to DR", "2 hours",
        "DevOps + Backend", "Component running on DR, traffic flowing"),
    DRTest("Full Failover", "Bi-annually", "Complete switch to DR site", "4 hours",
        "All teams", "All services running on DR, RTO met"),
    DRTest("Chaos Engineering", "Monthly", "Kill random pods/nodes", "1 hour",
        "DevOps + SRE", "System self-heals, no customer impact"),
]

print("DR Test Plan:")
for t in tests:
    print(f"  [{t.test_type}] {t.frequency}")
    print(f"    Scope: {t.scope}")
    print(f"    Duration: {t.duration} | Team: {t.participants}")
    print(f"    Pass: {t.success_criteria}")

dr_metrics = {
    "Last Full DR Test": "2024-10-15 — Passed (RTO: 18 min)",
    "Last Backup Restore": "2024-11-01 — Passed (12 min)",
    "Backup Success Rate": "99.8% (2 failures in 365 days)",
    "Actual RTO (last incident)": "22 minutes",
    "Actual RPO (last incident)": "3 minutes",
    "DR Site Readiness": "Warm standby, auto-scale ready",
    "Runbook Last Updated": "2024-11-15",
}

print(f"\n\nDR Metrics:")
for k, v in dr_metrics.items():
    print(f"  [{k}]: {v}")

เคล็ดลับ

Disaster Recovery Plan คืออะไร

แผนกู้คืนระบบ Server ล่ม ไฟไหม้ Ransomware RTO RPO ขั้นตอนกู้คืน ทดสอบ Quarter อัปเดตเมื่อเปลี่ยน Business Impact

RTO RPO คืออะไร

RTO เวลาสูงสุดหยุดทำงาน RPO ข้อมูลที่ยอมเสีย ยิ่งต่ำยิ่งลงทุนมาก Business Impact Analysis กำหนดตาม Criticality

Backup Strategy ทำอย่างไร

3-2-1 Rule 3 สำเนา 2 Media 1 Off-site WAL Archiving Streaming Replication rsync S3 Git Encrypt AES-256 Test Restore ทุกเดือน

ทดสอบ DR Plan อย่างไร

Tabletop Exercise Partial Failover Full Failover Chaos Engineering Backup Restore ทุก Quarter บันทึกเวลา ปรับปรุง Plan ทุกู้คืนรู้บทบาท

สรุป

Web Components Disaster Recovery Plan Backup 3-2-1 RTO RPO Failover DNS Database Replica Chaos Engineering DR Testing Runbook Production Operations

📖 บทความที่เกี่ยวข้อง

DALL-E API Disaster Recovery Planอ่านบทความ → Docker Multi-stage Build Disaster Recovery Planอ่านบทความ → Cloudflare Disaster Recovery Planอ่านบทความ → DNS over HTTPS Disaster Recovery Planอ่านบทความ → CSS Nesting Disaster Recovery Planอ่านบทความ →

📚 ดูบทความทั้งหมด →