BetterUptime Testing Strategy QA — กลยุทธ์ทดสอบ

BetterUptime QA

BetterUptime Testing Strategy QA Uptime Monitoring Incident Management Status Page On-call Alert SLA SLO SLI Chaos Testing Game Day HTTP TCP DNS Heartbeat

Monitor Tool	Free Plan	Monitors	Status Page	On-call
BetterUptime	10 monitors	HTTP TCP DNS	มี	มี
UptimeRobot	50 monitors	HTTP TCP Ping	มี	ไม่มี
Pingdom	ไม่มี	HTTP TCP	มี	ไม่มี
Datadog Synthetics	ไม่มี	HTTP API Browser	ไม่มี	มี

Monitor Setup

# === BetterUptime Configuration ===

# Monitor Types:
# 1. HTTP Monitor — ตรวจ Status Code + Response Time
# 2. TCP Monitor — ตรวจ Port เปิด
# 3. DNS Monitor — ตรวจ DNS Resolution
# 4. Heartbeat — ตรวจ Cron Job ทำงาน
# 5. Keyword — ตรวจ Keyword ใน Page

# API — Create Monitor
# curl -X POST https://betteruptime.com/api/v2/monitors \
#   -H "Authorization: Bearer TOKEN" \
#   -H "Content-Type: application/json" \
#   -d '{
#     "monitor_type": "status",
#     "url": "https://api.example.com/health",
#     "pronounceable_name": "API Health Check",
#     "check_frequency": 30,
#     "request_timeout": 15,
#     "confirmation_period": 3,
#     "regions": ["us", "eu", "ap"],
#     "expected_status_codes": [200],
#     "domain_expiration": 30,
#     "ssl_expiration": 14,
#     "follow_redirects": true,
#     "policy_id": "escalation-policy-id"
#   }'

# Terraform — Infrastructure as Code
# resource "betteruptime_monitor" "api_health" {
#   monitor_type     = "status"
#   url              = "https://api.example.com/health"
#   check_frequency  = 30
#   request_timeout  = 15
#   regions          = ["us", "eu", "ap"]
#   policy_id        = betteruptime_policy.default.id
# }

from dataclasses import dataclass
from typing import List

@dataclass
class Monitor:
    name: str
    type: str
    url: str
    frequency_sec: int
    uptime_30d: float
    avg_response_ms: int
    status: str

monitors = [
    Monitor("API Health", "HTTP", "https://api.example.com/health", 30, 99.95, 120, "Up"),
    Monitor("Web Frontend", "HTTP", "https://www.example.com", 60, 99.99, 250, "Up"),
    Monitor("Database", "TCP", "db.example.com:5432", 30, 99.90, 15, "Up"),
    Monitor("DNS Resolution", "DNS", "example.com", 60, 100.0, 25, "Up"),
    Monitor("Cron Backup", "Heartbeat", "Expected every 6h", 21600, 99.8, 0, "Up"),
    Monitor("SSL Certificate", "HTTP", "https://example.com", 86400, 100.0, 0, "Valid 45d"),
]

print("=== Monitors ===")
for m in monitors:
    print(f"  [{m.status}] {m.name} ({m.type})")
    print(f"    URL: {m.url}")
    print(f"    Frequency: {m.frequency_sec}s | Uptime: {m.uptime_30d}% | Response: {m.avg_response_ms}ms")

Testing Strategy

# === QA Testing for Monitoring ===

@dataclass
class TestCase:
    id: str
    category: str
    description: str
    method: str
    expected: str
    status: str

test_cases = [
    TestCase("TC-01", "Detection", "Monitor ตรวจจับ HTTP 500", "หยุด Service แล้วดู Alert", "Alert ภายใน 3 นาที", "Pass"),
    TestCase("TC-02", "Detection", "Monitor ตรวจจับ Timeout", "ตั้ง Delay > Timeout", "Alert ภายใน 3 นาที", "Pass"),
    TestCase("TC-03", "Alert", "Alert ส่งถึง Slack", "Trigger Downtime", "Slack message received", "Pass"),
    TestCase("TC-04", "Alert", "Alert ส่งถึง Email", "Trigger Downtime", "Email received < 1 min", "Pass"),
    TestCase("TC-05", "Alert", "SMS Alert ทำงาน", "Trigger Downtime", "SMS received < 2 min", "Pass"),
    TestCase("TC-06", "Escalation", "Escalate ถ้าไม่ Acknowledge", "ไม่ตอบ Alert 10 นาที", "Escalate to Level 2", "Pass"),
    TestCase("TC-07", "Status Page", "Status Page อัพเดทอัตโนมัติ", "Trigger Downtime", "Status = Degraded", "Pass"),
    TestCase("TC-08", "Recovery", "ตรวจจับ Recovery", "Start Service กลับ", "Alert resolved < 2 min", "Pass"),
    TestCase("TC-09", "SSL", "SSL Expiry Alert", "Certificate < 14 days", "Alert sent", "Pass"),
    TestCase("TC-10", "Heartbeat", "Cron Job Missing", "ไม่ส่ง Heartbeat 6h", "Alert sent", "Pass"),
]

print("\n=== QA Test Cases ===")
passed = sum(1 for t in test_cases if t.status == "Pass")
print(f"  Results: {passed}/{len(test_cases)} Passed\n")
for t in test_cases:
    print(f"  [{t.status}] {t.id} — {t.category}: {t.description}")
    print(f"    Method: {t.method}")
    print(f"    Expected: {t.expected}")

# Chaos Testing Schedule
chaos_tests = [
    "Weekly: ปิด Non-critical Service ดู Alert + Recovery",
    "Monthly: Game Day ซ้อม Full Incident Response",
    "Quarterly: Failover Test ย้าย Region ดู Monitoring",
    "On Change: ทดสอบ Monitor ใหม่ทุกครั้งที่เพิ่ม/แก้ไข",
]

print(f"\n\nChaos Testing Schedule:")
for i, c in enumerate(chaos_tests, 1):
    print(f"  {i}. {c}")

SLA Management

# === SLA/SLO/SLI ===

# SLI (Service Level Indicator) — ตัววัด
# - Availability: % time service is up
# - Latency: p50, p95, p99 response time
# - Error Rate: % of 5xx responses
# - Throughput: requests per second

# SLO (Service Level Objective) — เป้าหมาย
# - Availability: 99.9%
# - Latency p99: < 500ms
# - Error Rate: < 0.1%

# SLA (Service Level Agreement) — สัญญา
# - 99.9% uptime = max 8h 45m downtime/year
# - Credit if SLA breached

@dataclass
class SLATarget:
    service: str
    sla_pct: float
    max_downtime_month: str
    current_uptime: float
    error_budget_remaining: str
    status: str

sla_targets = [
    SLATarget("API Gateway", 99.95, "21 min", 99.97, "65%", "Healthy"),
    SLATarget("Web App", 99.9, "43 min", 99.95, "80%", "Healthy"),
    SLATarget("Database", 99.99, "4 min", 99.995, "90%", "Healthy"),
    SLATarget("CDN", 99.9, "43 min", 99.85, "15%", "At Risk"),
    SLATarget("Auth Service", 99.95, "21 min", 99.93, "35%", "Warning"),
]

print("SLA Dashboard:")
for s in sla_targets:
    emoji = "OK" if s.status == "Healthy" else s.status.upper()
    print(f"  [{emoji}] {s.service}")
    print(f"    SLA: {s.sla_pct}% | Current: {s.current_uptime}%")
    print(f"    Max Down: {s.max_downtime_month}/mo | Budget: {s.error_budget_remaining}")

# Uptime Calculation
print(f"\n\nUptime Reference:")
uptimes = {
    "99%": "7h 18m/month, 3.65 days/year",
    "99.9%": "43m 50s/month, 8h 46m/year",
    "99.95%": "21m 55s/month, 4h 23m/year",
    "99.99%": "4m 23s/month, 52m 36s/year",
    "99.999%": "26s/month, 5m 16s/year",
}
for pct, downtime in uptimes.items():
    print(f"  {pct}: {downtime}")

เคล็ดลับ

Multi-region: Monitor จากหลาย Region ป้องกัน False Positive
Confirmation: ตั้ง Confirmation Period ก่อน Alert
Game Day: ซ้อม Incident Response ทุกเดือน
Error Budget: ติดตาม Error Budget ไม่ให้หมด
Status Page: สร้าง Status Page สำหรับลูกค้า

การนำความรู้ไปประยุกต์ใช้งานจริง

แหล่งเรียนรู้ที่แนะนำ ได้แก่ Official Documentation ที่อัพเดทล่าสุดเสมอ Online Course จาก Coursera Udemy edX ช่อง YouTube คุณภาพทั้งไทยและอังกฤษ และ Community อย่าง Discord Reddit Stack Overflow ที่ช่วยแลกเปลี่ยนประสบการณ์กับนักพัฒนาทั่วโลก

เปรียบเทียบข้อดีและข้อเสีย

ข้อดี	ข้อเสีย
ประสิทธิภาพสูง ทำงานได้เร็วและแม่นยำ ลดเวลาทำงานซ้ำซ้อน	ต้องใช้เวลาเรียนรู้เบื้องต้นพอสมควร มี Learning Curve สูง
มี Community ขนาดใหญ่ มีคนช่วยเหลือและแหล่งเรียนรู้มากมาย	บางฟีเจอร์อาจยังไม่เสถียร หรือมีการเปลี่ยนแปลงบ่อยในเวอร์ชันใหม่
รองรับ Integration กับเครื่องมือและบริการอื่นได้หลากหลาย	ต้นทุนอาจสูงสำหรับ Enterprise License หรือ Cloud Service
เป็น Open Source หรือมีเวอร์ชันฟรีให้เริ่มต้นใช้งาน	ต้องการ Hardware หรือ Infrastructure ที่เพียงพอ

จากตารางเปรียบเทียบจะเห็นว่าข้อดีมีมากกว่าข้อเสียอย่างชัดเจน โดยเฉพาะในแง่ของประสิทธิภาพและความสามารถในการ Scale สำหรับข้อเสียส่วนใหญ่สามารถแก้ไขได้ด้วยการเรียนรู้อย่างเป็นระบบและวางแผนทรัพยากรให้เหมาะสม

BetterUptime คืออะไร

Uptime Monitoring Alert Downtime Status Page Incident On-call Heartbeat HTTP TCP DNS Keyword Free Plan ตรวจสอบเว็บ API

Testing Strategy สำหรับ Monitoring คืออะไร

ทดสอบ Detection Alert Escalation Status Page Recovery Chaos Testing Game Day Integration Slack Email SMS PagerDuty

ออกแบบ QA สำหรับ Monitoring อย่างไร

Test Cases HTTP Status Response Time Keyword SSL DNS Alert Routing Escalation Status Page Incident Timeline Integration On-call

SLA Monitoring ทำอย่างไร

SLA Target 99.9% 8.76h/yr SLI Availability Latency Error Rate SLO เป้าหมาย Error Budget Status Page Report Alert

สรุป

BetterUptime Testing Strategy QA Uptime Monitoring Incident Status Page On-call SLA SLO SLI Error Budget Chaos Testing Game Day Alert Escalation Multi-region