Uptime Kuma Monitoring Service Mesh Setup —

Uptime Kuma Mesh

Uptime Kuma Monitoring Service Mesh Health Check HTTP TCP DNS gRPC Alerting Telegram Slack PagerDuty Status Page Linkerd Istio Observability Production

เนื้อหาเกี่ยวข้อง — ทำความเข้าใจ đường trung bình của tam giác là

Monitor Type	Target	Check	Interval	Alert Threshold
HTTP	Service /health	Status 200 + Body	20s	3 failures
TCP	Database Port	Port Open	30s	2 failures
DNS	Service Discovery	Record Resolve	60s	2 failures
gRPC	gRPC Service	Health RPC	20s	3 failures
Push	CronJob	Heartbeat	Custom	1 miss
Docker	Sidecar Proxy	Container Up	30s	1 failure

Service Mesh Monitoring

# === Uptime Kuma for Service Mesh ===

# Docker Compose — Uptime Kuma in K8s namespace
# apiVersion: apps/v1
# kind: Deployment
# metadata:
#   name: uptime-kuma
#   namespace: monitoring
# spec:
#   replicas: 1
#   selector:
#     matchLabels:
#       app: uptime-kuma
#   template:
#     metadata:
#       labels:
#         app: uptime-kuma
#       annotations:
#         linkerd.io/inject: enabled
#     spec:
#       containers:
#         - name: uptime-kuma
#           image: louislam/uptime-kuma:latest
#           ports:
#             - containerPort: 3001
#           volumeMounts:
#             - name: data
#               mountPath: /app/data
#           resources:
#             requests:
#               memory: "256Mi"
#               cpu: "100m"
#             limits:
#               memory: "512Mi"
#               cpu: "500m"
#       volumes:
#         - name: data
#           persistentVolumeClaim:
#             claimName: uptime-kuma-pvc

# Service for internal access
# apiVersion: v1
# kind: Service
# metadata:
#   name: uptime-kuma
#   namespace: monitoring
# spec:
#   selector:
#     app: uptime-kuma
#   ports:
#     - port: 3001
#       targetPort: 3001

# Ingress for external access
# apiVersion: networking.k8s.io/v1
# kind: Ingress
# metadata:
#   name: uptime-kuma
#   namespace: monitoring
#   annotations:
#     cert-manager.io/cluster-issuer: letsencrypt
# spec:
#   tls:
#     - hosts: [status.example.com]
#       secretName: status-tls
#   rules:
#     - host: status.example.com
#       http:
#         paths:
#           - path: /
#             pathType: Prefix
#             backend:
#               service:
#                 name: uptime-kuma
#                 port: {number: 3001}

from dataclasses import dataclass

@dataclass
class MeshMonitor:
    service: str
    namespace: str
    monitor_type: str
    url: str
    interval: int
    alert_channel: str

monitors = [
    MeshMonitor("API Gateway", "production", "HTTP", "http://api-gateway.production:8080/health", 20, "PagerDuty"),
    MeshMonitor("Order Service", "production", "HTTP", "http://order-svc.production:8080/health", 20, "Slack"),
    MeshMonitor("Payment Service", "production", "HTTP", "http://payment-svc.production:8080/health", 20, "PagerDuty"),
    MeshMonitor("PostgreSQL", "data", "TCP", "postgresql.data:5432", 30, "PagerDuty"),
    MeshMonitor("Redis", "data", "TCP", "redis.data:6379", 30, "Slack"),
    MeshMonitor("NATS", "messaging", "TCP", "nats.messaging:4222", 30, "Slack"),
    MeshMonitor("Linkerd Control", "linkerd", "HTTP", "http://linkerd-web.linkerd:8084/ready", 60, "PagerDuty"),
    MeshMonitor("Backup CronJob", "production", "Push", "Push URL", 3600, "Telegram"),
]

print("=== Service Mesh Monitors ===")
for m in monitors:
    print(f"  [{m.service}] NS: {m.namespace} | Type: {m.monitor_type}")
    print(f"    URL: {m.url}")
    print(f"    Interval: {m.interval}s | Alert: {m.alert_channel}")

Alerting Configuration

# === Alert Configuration ===

# Notification Setup in Uptime Kuma:
# 1. Telegram:
#    Bot Token: 123456:ABC-DEF
#    Chat ID: -1001234567890
#    Message: "🔴 {{NAME}} is {{STATUS}} - {{MSG}}"
#
# 2. Slack:
#    Webhook: https://hooks.slack.com/services/T.../B.../xxx
#    Channel: #alerts-production
#
# 3. PagerDuty:
#    Integration Key: xxxxx
#    Severity: critical (for P0 services)
#
# 4. Discord:
#    Webhook: https://discord.com/api/webhooks/xxx/yyy

# Alert Escalation Matrix
# P0 (Critical): API Gateway, Payment → PagerDuty + Slack + Telegram
# P1 (High): Order, Auth → Slack + Telegram
# P2 (Medium): Cache, Queue → Slack only
# P3 (Low): CronJob, Batch → Telegram only

@dataclass
class AlertRule:
    priority: str
    services: str
    channels: str
    response_time: str
    retry_before_alert: int
    escalation: str

rules = [
    AlertRule("P0 Critical", "API Gateway, Payment, Database", "PagerDuty + Slack + Telegram", "5 min", 2, "Escalate to Manager after 15min"),
    AlertRule("P1 High", "Order, Auth, User Service", "Slack + Telegram", "15 min", 3, "Escalate to Lead after 30min"),
    AlertRule("P2 Medium", "Redis, NATS, Search", "Slack", "30 min", 3, "Review in daily standup"),
    AlertRule("P3 Low", "CronJob, Batch, Reports", "Telegram", "4 hours", 1, "Review weekly"),
]

print("=== Alert Escalation Matrix ===")
for r in rules:
    print(f"  [{r.priority}] Services: {r.services}")
    print(f"    Channels: {r.channels}")
    print(f"    Response: {r.response_time} | Retry: {r.retry_before_alert}")
    print(f"    Escalation: {r.escalation}")

Status Page and Operations

# === Status Page Configuration ===

# Status Page Groups:
# 1. Frontend Services
#    - Website (HTTP)
#    - CDN (HTTP)
#    - Static Assets (HTTP)
#
# 2. API Services
#    - API Gateway (HTTP)
#    - GraphQL (HTTP)
#    - WebSocket (TCP)
#
# 3. Backend Services
#    - Order Service (HTTP)
#    - Payment Service (HTTP)
#    - Notification (HTTP)
#
# 4. Infrastructure
#    - Database (TCP)
#    - Cache (TCP)
#    - Message Broker (TCP)
#    - Service Mesh (HTTP)

@dataclass
class OperationalMetric:
    metric: str
    current: str
    target: str
    trend: str

ops_metrics = [
    OperationalMetric("Overall Uptime (30d)", "99.95%", "99.9%", "Above target"),
    OperationalMetric("API Gateway Uptime", "99.99%", "99.95%", "Excellent"),
    OperationalMetric("Avg Response Time", "145ms", "<200ms", "Good"),
    OperationalMetric("P99 Response Time", "820ms", "<1000ms", "Good"),
    OperationalMetric("Incidents (30d)", "2", "<5", "Good"),
    OperationalMetric("MTTR", "12 min", "<30 min", "Excellent"),
    OperationalMetric("False Positive Rate", "0.5%", "<2%", "Good"),
    OperationalMetric("Monitor Count", "24", "N/A", "All active"),
]

print("Operational Metrics:")
for m in ops_metrics:
    print(f"  [{m.metric}] Current: {m.current} | Target: {m.target} | {m.trend}")

runbook = {
    "Service Down": "1. Check Uptime Kuma alert 2. kubectl get pods 3. Check logs 4. Restart if needed 5. Verify recovery",
    "High Latency": "1. Check Linkerd Viz dashboard 2. Identify slow service 3. Check resource usage 4. Scale if needed",
    "Database Down": "1. PagerDuty alert 2. Check PostgreSQL status 3. Failover if replica available 4. Restore from backup",
    "Mesh Control Down": "1. Check linkerd check 2. Restart control plane 3. Verify data plane healthy",
    "Certificate Expiry": "1. Check cert-manager 2. Renew certificate 3. Verify mTLS working",
}

print(f"\n\nRunbook:")
for k, v in runbook.items():
    print(f"  [{k}]: {v}")

เคล็ดลับ

Internal: Deploy Uptime Kuma ภายใน Cluster ตรวจ Internal Service
Retry: ตั้ง Retry 2-3 ครั้งก่อน Alert ลด False Positive
Group: จัดกลุ่ม Monitor ตาม Namespace และ Priority
Backup: Backup Uptime Kuma Data (/app/data) ทุกวัน
Runbook: สร้าง Runbook สำหรับทุก Alert Scenario

ใช้ Uptime Kuma กับ Service Mesh อย่างไร

HTTP Health TCP Port DNS gRPC Monitor ทุก Service Internal DNS ClusterIP Response Time Status Code Alert Linkerd Istio Observability

แนะนำเพิ่มเติม — บทวิเคราะห์จาก XM Signal

เนื้อหาเกี่ยวข้อง — ดูเพิ่มเติมเรื่อง ethical hacking certification free

เนื้อหาเกี่ยวข้อง — กองทุนรวมเวียดนาม — ข้อมูลครบถ้วน 2026