BetterUptime QA
BetterUptime Testing Strategy QA Uptime Monitoring Incident Management Status Page On-call Alert SLA SLO SLI Chaos Testing Game Day HTTP TCP DNS Heartbeat
| Monitor Tool | Free Plan | Monitors | Status Page | On-call |
|---|---|---|---|---|
| BetterUptime | 10 monitors | HTTP TCP DNS | มี | มี |
| UptimeRobot | 50 monitors | HTTP TCP Ping | มี | ไม่มี |
| Pingdom | ไม่มี | HTTP TCP | มี | ไม่มี |
| Datadog Synthetics | ไม่มี | HTTP API Browser | ไม่มี | มี |
Monitor Setup
# === BetterUptime Configuration ===
# Monitor Types:
# 1. HTTP Monitor — ตรวจ Status Code + Response Time
# 2. TCP Monitor — ตรวจ Port เปิด
# 3. DNS Monitor — ตรวจ DNS Resolution
# 4. Heartbeat — ตรวจ Cron Job ทำงาน
# 5. Keyword — ตรวจ Keyword ใน Page
# API — Create Monitor
# curl -X POST https://betteruptime.com/api/v2/monitors \
# -H "Authorization: Bearer TOKEN" \
# -H "Content-Type: application/json" \
# -d '{
# "monitor_type": "status",
# "url": "https://api.example.com/health",
# "pronounceable_name": "API Health Check",
# "check_frequency": 30,
# "request_timeout": 15,
# "confirmation_period": 3,
# "regions": ["us", "eu", "ap"],
# "expected_status_codes": [200],
# "domain_expiration": 30,
# "ssl_expiration": 14,
# "follow_redirects": true,
# "policy_id": "escalation-policy-id"
# }'
# Terraform — Infrastructure as Code
# resource "betteruptime_monitor" "api_health" {
# monitor_type = "status"
# url = "https://api.example.com/health"
# check_frequency = 30
# request_timeout = 15
# regions = ["us", "eu", "ap"]
# policy_id = betteruptime_policy.default.id
# }
from dataclasses import dataclass
from typing import List
@dataclass
class Monitor:
name: str
type: str
url: str
frequency_sec: int
uptime_30d: float
avg_response_ms: int
status: str
monitors = [
Monitor("API Health", "HTTP", "https://api.example.com/health", 30, 99.95, 120, "Up"),
Monitor("Web Frontend", "HTTP", "https://www.example.com", 60, 99.99, 250, "Up"),
Monitor("Database", "TCP", "db.example.com:5432", 30, 99.90, 15, "Up"),
Monitor("DNS Resolution", "DNS", "example.com", 60, 100.0, 25, "Up"),
Monitor("Cron Backup", "Heartbeat", "Expected every 6h", 21600, 99.8, 0, "Up"),
Monitor("SSL Certificate", "HTTP", "https://example.com", 86400, 100.0, 0, "Valid 45d"),
]
print("=== Monitors ===")
for m in monitors:
print(f" [{m.status}] {m.name} ({m.type})")
print(f" URL: {m.url}")
print(f" Frequency: {m.frequency_sec}s | Uptime: {m.uptime_30d}% | Response: {m.avg_response_ms}ms")
Testing Strategy
# === QA Testing for Monitoring ===
@dataclass
class TestCase:
id: str
category: str
description: str
method: str
expected: str
status: str
test_cases = [
TestCase("TC-01", "Detection", "Monitor ตรวจจับ HTTP 500", "หยุด Service แล้วดู Alert", "Alert ภายใน 3 นาที", "Pass"),
TestCase("TC-02", "Detection", "Monitor ตรวจจับ Timeout", "ตั้ง Delay > Timeout", "Alert ภายใน 3 นาที", "Pass"),
TestCase("TC-03", "Alert", "Alert ส่งถึง Slack", "Trigger Downtime", "Slack message received", "Pass"),
TestCase("TC-04", "Alert", "Alert ส่งถึง Email", "Trigger Downtime", "Email received < 1 min", "Pass"),
TestCase("TC-05", "Alert", "SMS Alert ทำงาน", "Trigger Downtime", "SMS received < 2 min", "Pass"),
TestCase("TC-06", "Escalation", "Escalate ถ้าไม่ Acknowledge", "ไม่ตอบ Alert 10 นาที", "Escalate to Level 2", "Pass"),
TestCase("TC-07", "Status Page", "Status Page อัพเดทอัตโนมัติ", "Trigger Downtime", "Status = Degraded", "Pass"),
TestCase("TC-08", "Recovery", "ตรวจจับ Recovery", "Start Service กลับ", "Alert resolved < 2 min", "Pass"),
TestCase("TC-09", "SSL", "SSL Expiry Alert", "Certificate < 14 days", "Alert sent", "Pass"),
TestCase("TC-10", "Heartbeat", "Cron Job Missing", "ไม่ส่ง Heartbeat 6h", "Alert sent", "Pass"),
]
print("\n=== QA Test Cases ===")
passed = sum(1 for t in test_cases if t.status == "Pass")
print(f" Results: {passed}/{len(test_cases)} Passed\n")
for t in test_cases:
print(f" [{t.status}] {t.id} — {t.category}: {t.description}")
print(f" Method: {t.method}")
print(f" Expected: {t.expected}")
# Chaos Testing Schedule
chaos_tests = [
"Weekly: ปิด Non-critical Service ดู Alert + Recovery",
"Monthly: Game Day ซ้อม Full Incident Response",
"Quarterly: Failover Test ย้าย Region ดู Monitoring",
"On Change: ทดสอบ Monitor ใหม่ทุกครั้งที่เพิ่ม/แก้ไข",
]
print(f"\n\nChaos Testing Schedule:")
for i, c in enumerate(chaos_tests, 1):
print(f" {i}. {c}")
SLA Management
# === SLA/SLO/SLI ===
# SLI (Service Level Indicator) — ตัววัด
# - Availability: % time service is up
# - Latency: p50, p95, p99 response time
# - Error Rate: % of 5xx responses
# - Throughput: requests per second
# SLO (Service Level Objective) — เป้าหมาย
# - Availability: 99.9%
# - Latency p99: < 500ms
# - Error Rate: < 0.1%
# SLA (Service Level Agreement) — สัญญา
# - 99.9% uptime = max 8h 45m downtime/year
# - Credit if SLA breached
@dataclass
class SLATarget:
service: str
sla_pct: float
max_downtime_month: str
current_uptime: float
error_budget_remaining: str
status: str
sla_targets = [
SLATarget("API Gateway", 99.95, "21 min", 99.97, "65%", "Healthy"),
SLATarget("Web App", 99.9, "43 min", 99.95, "80%", "Healthy"),
SLATarget("Database", 99.99, "4 min", 99.995, "90%", "Healthy"),
SLATarget("CDN", 99.9, "43 min", 99.85, "15%", "At Risk"),
SLATarget("Auth Service", 99.95, "21 min", 99.93, "35%", "Warning"),
]
print("SLA Dashboard:")
for s in sla_targets:
emoji = "OK" if s.status == "Healthy" else s.status.upper()
print(f" [{emoji}] {s.service}")
print(f" SLA: {s.sla_pct}% | Current: {s.current_uptime}%")
print(f" Max Down: {s.max_downtime_month}/mo | Budget: {s.error_budget_remaining}")
# Uptime Calculation
print(f"\n\nUptime Reference:")
uptimes = {
"99%": "7h 18m/month, 3.65 days/year",
"99.9%": "43m 50s/month, 8h 46m/year",
"99.95%": "21m 55s/month, 4h 23m/year",
"99.99%": "4m 23s/month, 52m 36s/year",
"99.999%": "26s/month, 5m 16s/year",
}
for pct, downtime in uptimes.items():
print(f" {pct}: {downtime}")
เคล็ดลับ
- Multi-region: Monitor จากหลาย Region ป้องกัน False Positive
- Confirmation: ตั้ง Confirmation Period ก่อน Alert
- Game Day: ซ้อม Incident Response ทุกเดือน
- Error Budget: ติดตาม Error Budget ไม่ให้หมด
- Status Page: สร้าง Status Page สำหรับลูกค้า
การนำความรู้ไปประยุกต์ใช้งานจริง
แหล่งเรียนรู้ที่แนะนำ ได้แก่ Official Documentation ที่อัพเดทล่าสุดเสมอ Online Course จาก Coursera Udemy edX ช่อง YouTube คุณภาพทั้งไทยและอังกฤษ และ Community อย่าง Discord Reddit Stack Overflow ที่ช่วยแลกเปลี่ยนประสบการณ์กับนักพัฒนาทั่วโลก
เปรียบเทียบข้อดีและข้อเสีย
จากตารางเปรียบเทียบจะเห็นว่าข้อดีมีมากกว่าข้อเสียอย่างชัดเจน โดยเฉพาะในแง่ของประสิทธิภาพและความสามารถในการ Scale สำหรับข้อเสียส่วนใหญ่สามารถแก้ไขได้ด้วยการเรียนรู้อย่างเป็นระบบและวางแผนทรัพยากรให้เหมาะสม
BetterUptime คืออะไร
Uptime Monitoring Alert Downtime Status Page Incident On-call Heartbeat HTTP TCP DNS Keyword Free Plan ตรวจสอบเว็บ API
Testing Strategy สำหรับ Monitoring คืออะไร
ทดสอบ Detection Alert Escalation Status Page Recovery Chaos Testing Game Day Integration Slack Email SMS PagerDuty
ออกแบบ QA สำหรับ Monitoring อย่างไร
Test Cases HTTP Status Response Time Keyword SSL DNS Alert Routing Escalation Status Page Incident Timeline Integration On-call
SLA Monitoring ทำอย่างไร
SLA Target 99.9% 8.76h/yr SLI Availability Latency Error Rate SLO เป้าหมาย Error Budget Status Page Report Alert
สรุป
BetterUptime Testing Strategy QA Uptime Monitoring Incident Status Page On-call SLA SLO SLI Error Budget Chaos Testing Game Day Alert Escalation Multi-region
