SiamCafe.net Blog
Cybersecurity

Uptime Kuma Monitoring Service Mesh Setup

uptime kuma monitoring service mesh setup
Uptime Kuma Monitoring Service Mesh Setup | SiamCafe Blog
2026-04-04· อ. บอม — SiamCafe.net· 9,979 คำ

Uptime Kuma Mesh

Uptime Kuma Monitoring Service Mesh Health Check HTTP TCP DNS gRPC Alerting Telegram Slack PagerDuty Status Page Linkerd Istio Observability Production

Monitor TypeTargetCheckIntervalAlert Threshold
HTTPService /healthStatus 200 + Body20s3 failures
TCPDatabase PortPort Open30s2 failures
DNSService DiscoveryRecord Resolve60s2 failures
gRPCgRPC ServiceHealth RPC20s3 failures
PushCronJobHeartbeatCustom1 miss
DockerSidecar ProxyContainer Up30s1 failure

Service Mesh Monitoring

# === Uptime Kuma for Service Mesh ===

# Docker Compose — Uptime Kuma in K8s namespace
# apiVersion: apps/v1
# kind: Deployment
# metadata:
#   name: uptime-kuma
#   namespace: monitoring
# spec:
#   replicas: 1
#   selector:
#     matchLabels:
#       app: uptime-kuma
#   template:
#     metadata:
#       labels:
#         app: uptime-kuma
#       annotations:
#         linkerd.io/inject: enabled
#     spec:
#       containers:
#         - name: uptime-kuma
#           image: louislam/uptime-kuma:latest
#           ports:
#             - containerPort: 3001
#           volumeMounts:
#             - name: data
#               mountPath: /app/data
#           resources:
#             requests:
#               memory: "256Mi"
#               cpu: "100m"
#             limits:
#               memory: "512Mi"
#               cpu: "500m"
#       volumes:
#         - name: data
#           persistentVolumeClaim:
#             claimName: uptime-kuma-pvc

# Service for internal access
# apiVersion: v1
# kind: Service
# metadata:
#   name: uptime-kuma
#   namespace: monitoring
# spec:
#   selector:
#     app: uptime-kuma
#   ports:
#     - port: 3001
#       targetPort: 3001

# Ingress for external access
# apiVersion: networking.k8s.io/v1
# kind: Ingress
# metadata:
#   name: uptime-kuma
#   namespace: monitoring
#   annotations:
#     cert-manager.io/cluster-issuer: letsencrypt
# spec:
#   tls:
#     - hosts: [status.example.com]
#       secretName: status-tls
#   rules:
#     - host: status.example.com
#       http:
#         paths:
#           - path: /
#             pathType: Prefix
#             backend:
#               service:
#                 name: uptime-kuma
#                 port: {number: 3001}

from dataclasses import dataclass

@dataclass
class MeshMonitor:
    service: str
    namespace: str
    monitor_type: str
    url: str
    interval: int
    alert_channel: str

monitors = [
    MeshMonitor("API Gateway", "production", "HTTP", "http://api-gateway.production:8080/health", 20, "PagerDuty"),
    MeshMonitor("Order Service", "production", "HTTP", "http://order-svc.production:8080/health", 20, "Slack"),
    MeshMonitor("Payment Service", "production", "HTTP", "http://payment-svc.production:8080/health", 20, "PagerDuty"),
    MeshMonitor("PostgreSQL", "data", "TCP", "postgresql.data:5432", 30, "PagerDuty"),
    MeshMonitor("Redis", "data", "TCP", "redis.data:6379", 30, "Slack"),
    MeshMonitor("NATS", "messaging", "TCP", "nats.messaging:4222", 30, "Slack"),
    MeshMonitor("Linkerd Control", "linkerd", "HTTP", "http://linkerd-web.linkerd:8084/ready", 60, "PagerDuty"),
    MeshMonitor("Backup CronJob", "production", "Push", "Push URL", 3600, "Telegram"),
]

print("=== Service Mesh Monitors ===")
for m in monitors:
    print(f"  [{m.service}] NS: {m.namespace} | Type: {m.monitor_type}")
    print(f"    URL: {m.url}")
    print(f"    Interval: {m.interval}s | Alert: {m.alert_channel}")

Alerting Configuration

# === Alert Configuration ===

# Notification Setup in Uptime Kuma:
# 1. Telegram:
#    Bot Token: 123456:ABC-DEF
#    Chat ID: -1001234567890
#    Message: "🔴 {{NAME}} is {{STATUS}} - {{MSG}}"
#
# 2. Slack:
#    Webhook: https://hooks.slack.com/services/T.../B.../xxx
#    Channel: #alerts-production
#
# 3. PagerDuty:
#    Integration Key: xxxxx
#    Severity: critical (for P0 services)
#
# 4. Discord:
#    Webhook: https://discord.com/api/webhooks/xxx/yyy

# Alert Escalation Matrix
# P0 (Critical): API Gateway, Payment → PagerDuty + Slack + Telegram
# P1 (High): Order, Auth → Slack + Telegram
# P2 (Medium): Cache, Queue → Slack only
# P3 (Low): CronJob, Batch → Telegram only

@dataclass
class AlertRule:
    priority: str
    services: str
    channels: str
    response_time: str
    retry_before_alert: int
    escalation: str

rules = [
    AlertRule("P0 Critical", "API Gateway, Payment, Database", "PagerDuty + Slack + Telegram", "5 min", 2, "Escalate to Manager after 15min"),
    AlertRule("P1 High", "Order, Auth, User Service", "Slack + Telegram", "15 min", 3, "Escalate to Lead after 30min"),
    AlertRule("P2 Medium", "Redis, NATS, Search", "Slack", "30 min", 3, "Review in daily standup"),
    AlertRule("P3 Low", "CronJob, Batch, Reports", "Telegram", "4 hours", 1, "Review weekly"),
]

print("=== Alert Escalation Matrix ===")
for r in rules:
    print(f"  [{r.priority}] Services: {r.services}")
    print(f"    Channels: {r.channels}")
    print(f"    Response: {r.response_time} | Retry: {r.retry_before_alert}")
    print(f"    Escalation: {r.escalation}")

Status Page and Operations

# === Status Page Configuration ===

# Status Page Groups:
# 1. Frontend Services
#    - Website (HTTP)
#    - CDN (HTTP)
#    - Static Assets (HTTP)
#
# 2. API Services
#    - API Gateway (HTTP)
#    - GraphQL (HTTP)
#    - WebSocket (TCP)
#
# 3. Backend Services
#    - Order Service (HTTP)
#    - Payment Service (HTTP)
#    - Notification (HTTP)
#
# 4. Infrastructure
#    - Database (TCP)
#    - Cache (TCP)
#    - Message Broker (TCP)
#    - Service Mesh (HTTP)

@dataclass
class OperationalMetric:
    metric: str
    current: str
    target: str
    trend: str

ops_metrics = [
    OperationalMetric("Overall Uptime (30d)", "99.95%", "99.9%", "Above target"),
    OperationalMetric("API Gateway Uptime", "99.99%", "99.95%", "Excellent"),
    OperationalMetric("Avg Response Time", "145ms", "<200ms", "Good"),
    OperationalMetric("P99 Response Time", "820ms", "<1000ms", "Good"),
    OperationalMetric("Incidents (30d)", "2", "<5", "Good"),
    OperationalMetric("MTTR", "12 min", "<30 min", "Excellent"),
    OperationalMetric("False Positive Rate", "0.5%", "<2%", "Good"),
    OperationalMetric("Monitor Count", "24", "N/A", "All active"),
]

print("Operational Metrics:")
for m in ops_metrics:
    print(f"  [{m.metric}] Current: {m.current} | Target: {m.target} | {m.trend}")

runbook = {
    "Service Down": "1. Check Uptime Kuma alert 2. kubectl get pods 3. Check logs 4. Restart if needed 5. Verify recovery",
    "High Latency": "1. Check Linkerd Viz dashboard 2. Identify slow service 3. Check resource usage 4. Scale if needed",
    "Database Down": "1. PagerDuty alert 2. Check PostgreSQL status 3. Failover if replica available 4. Restore from backup",
    "Mesh Control Down": "1. Check linkerd check 2. Restart control plane 3. Verify data plane healthy",
    "Certificate Expiry": "1. Check cert-manager 2. Renew certificate 3. Verify mTLS working",
}

print(f"\n\nRunbook:")
for k, v in runbook.items():
    print(f"  [{k}]: {v}")

เคล็ดลับ

ใช้ Uptime Kuma กับ Service Mesh อย่างไร

HTTP Health TCP Port DNS gRPC Monitor ทุก Service Internal DNS ClusterIP Response Time Status Code Alert Linkerd Istio Observability

Monitor อะไรบ้างใน Service Mesh

HTTP /health /ready TCP Database Redis gRPC Health DNS Service Discovery Push CronJob Docker Sidecar Certificate mTLS Control Plane

ตั้ง Alert อย่างไร

Telegram Slack PagerDuty Discord Email Down Response Time Certificate Retry False Positive Escalation Policy Critical Service Priority

สร้าง Status Page สำหรับ Mesh อย่างไร

Service Group Frontend Backend Data Infrastructure Namespace Uptime Percentage Response Time Incident Custom Domain Public Internal Maintenance

สรุป

Uptime Kuma Monitoring Service Mesh Health Check HTTP TCP DNS gRPC Alerting PagerDuty Slack Status Page Linkerd Istio Kubernetes Runbook Production Operations

📖 บทความที่เกี่ยวข้อง

Uptime Kuma Monitoring Edge Deploymentอ่านบทความ → Uptime Kuma Monitoring Disaster Recovery Planอ่านบทความ → Uptime Kuma Monitoring Kubernetes Deploymentอ่านบทความ →

📚 ดูบทความทั้งหมด →