Uptime Kuma Mesh
Uptime Kuma Monitoring Service Mesh Health Check HTTP TCP DNS gRPC Alerting Telegram Slack PagerDuty Status Page Linkerd Istio Observability Production
| Monitor Type | Target | Check | Interval | Alert Threshold |
|---|---|---|---|---|
| HTTP | Service /health | Status 200 + Body | 20s | 3 failures |
| TCP | Database Port | Port Open | 30s | 2 failures |
| DNS | Service Discovery | Record Resolve | 60s | 2 failures |
| gRPC | gRPC Service | Health RPC | 20s | 3 failures |
| Push | CronJob | Heartbeat | Custom | 1 miss |
| Docker | Sidecar Proxy | Container Up | 30s | 1 failure |
Service Mesh Monitoring
# === Uptime Kuma for Service Mesh ===
# Docker Compose — Uptime Kuma in K8s namespace
# apiVersion: apps/v1
# kind: Deployment
# metadata:
# name: uptime-kuma
# namespace: monitoring
# spec:
# replicas: 1
# selector:
# matchLabels:
# app: uptime-kuma
# template:
# metadata:
# labels:
# app: uptime-kuma
# annotations:
# linkerd.io/inject: enabled
# spec:
# containers:
# - name: uptime-kuma
# image: louislam/uptime-kuma:latest
# ports:
# - containerPort: 3001
# volumeMounts:
# - name: data
# mountPath: /app/data
# resources:
# requests:
# memory: "256Mi"
# cpu: "100m"
# limits:
# memory: "512Mi"
# cpu: "500m"
# volumes:
# - name: data
# persistentVolumeClaim:
# claimName: uptime-kuma-pvc
# Service for internal access
# apiVersion: v1
# kind: Service
# metadata:
# name: uptime-kuma
# namespace: monitoring
# spec:
# selector:
# app: uptime-kuma
# ports:
# - port: 3001
# targetPort: 3001
# Ingress for external access
# apiVersion: networking.k8s.io/v1
# kind: Ingress
# metadata:
# name: uptime-kuma
# namespace: monitoring
# annotations:
# cert-manager.io/cluster-issuer: letsencrypt
# spec:
# tls:
# - hosts: [status.example.com]
# secretName: status-tls
# rules:
# - host: status.example.com
# http:
# paths:
# - path: /
# pathType: Prefix
# backend:
# service:
# name: uptime-kuma
# port: {number: 3001}
from dataclasses import dataclass
@dataclass
class MeshMonitor:
service: str
namespace: str
monitor_type: str
url: str
interval: int
alert_channel: str
monitors = [
MeshMonitor("API Gateway", "production", "HTTP", "http://api-gateway.production:8080/health", 20, "PagerDuty"),
MeshMonitor("Order Service", "production", "HTTP", "http://order-svc.production:8080/health", 20, "Slack"),
MeshMonitor("Payment Service", "production", "HTTP", "http://payment-svc.production:8080/health", 20, "PagerDuty"),
MeshMonitor("PostgreSQL", "data", "TCP", "postgresql.data:5432", 30, "PagerDuty"),
MeshMonitor("Redis", "data", "TCP", "redis.data:6379", 30, "Slack"),
MeshMonitor("NATS", "messaging", "TCP", "nats.messaging:4222", 30, "Slack"),
MeshMonitor("Linkerd Control", "linkerd", "HTTP", "http://linkerd-web.linkerd:8084/ready", 60, "PagerDuty"),
MeshMonitor("Backup CronJob", "production", "Push", "Push URL", 3600, "Telegram"),
]
print("=== Service Mesh Monitors ===")
for m in monitors:
print(f" [{m.service}] NS: {m.namespace} | Type: {m.monitor_type}")
print(f" URL: {m.url}")
print(f" Interval: {m.interval}s | Alert: {m.alert_channel}")
Alerting Configuration
# === Alert Configuration ===
# Notification Setup in Uptime Kuma:
# 1. Telegram:
# Bot Token: 123456:ABC-DEF
# Chat ID: -1001234567890
# Message: "🔴 {{NAME}} is {{STATUS}} - {{MSG}}"
#
# 2. Slack:
# Webhook: https://hooks.slack.com/services/T.../B.../xxx
# Channel: #alerts-production
#
# 3. PagerDuty:
# Integration Key: xxxxx
# Severity: critical (for P0 services)
#
# 4. Discord:
# Webhook: https://discord.com/api/webhooks/xxx/yyy
# Alert Escalation Matrix
# P0 (Critical): API Gateway, Payment → PagerDuty + Slack + Telegram
# P1 (High): Order, Auth → Slack + Telegram
# P2 (Medium): Cache, Queue → Slack only
# P3 (Low): CronJob, Batch → Telegram only
@dataclass
class AlertRule:
priority: str
services: str
channels: str
response_time: str
retry_before_alert: int
escalation: str
rules = [
AlertRule("P0 Critical", "API Gateway, Payment, Database", "PagerDuty + Slack + Telegram", "5 min", 2, "Escalate to Manager after 15min"),
AlertRule("P1 High", "Order, Auth, User Service", "Slack + Telegram", "15 min", 3, "Escalate to Lead after 30min"),
AlertRule("P2 Medium", "Redis, NATS, Search", "Slack", "30 min", 3, "Review in daily standup"),
AlertRule("P3 Low", "CronJob, Batch, Reports", "Telegram", "4 hours", 1, "Review weekly"),
]
print("=== Alert Escalation Matrix ===")
for r in rules:
print(f" [{r.priority}] Services: {r.services}")
print(f" Channels: {r.channels}")
print(f" Response: {r.response_time} | Retry: {r.retry_before_alert}")
print(f" Escalation: {r.escalation}")
Status Page and Operations
# === Status Page Configuration ===
# Status Page Groups:
# 1. Frontend Services
# - Website (HTTP)
# - CDN (HTTP)
# - Static Assets (HTTP)
#
# 2. API Services
# - API Gateway (HTTP)
# - GraphQL (HTTP)
# - WebSocket (TCP)
#
# 3. Backend Services
# - Order Service (HTTP)
# - Payment Service (HTTP)
# - Notification (HTTP)
#
# 4. Infrastructure
# - Database (TCP)
# - Cache (TCP)
# - Message Broker (TCP)
# - Service Mesh (HTTP)
@dataclass
class OperationalMetric:
metric: str
current: str
target: str
trend: str
ops_metrics = [
OperationalMetric("Overall Uptime (30d)", "99.95%", "99.9%", "Above target"),
OperationalMetric("API Gateway Uptime", "99.99%", "99.95%", "Excellent"),
OperationalMetric("Avg Response Time", "145ms", "<200ms", "Good"),
OperationalMetric("P99 Response Time", "820ms", "<1000ms", "Good"),
OperationalMetric("Incidents (30d)", "2", "<5", "Good"),
OperationalMetric("MTTR", "12 min", "<30 min", "Excellent"),
OperationalMetric("False Positive Rate", "0.5%", "<2%", "Good"),
OperationalMetric("Monitor Count", "24", "N/A", "All active"),
]
print("Operational Metrics:")
for m in ops_metrics:
print(f" [{m.metric}] Current: {m.current} | Target: {m.target} | {m.trend}")
runbook = {
"Service Down": "1. Check Uptime Kuma alert 2. kubectl get pods 3. Check logs 4. Restart if needed 5. Verify recovery",
"High Latency": "1. Check Linkerd Viz dashboard 2. Identify slow service 3. Check resource usage 4. Scale if needed",
"Database Down": "1. PagerDuty alert 2. Check PostgreSQL status 3. Failover if replica available 4. Restore from backup",
"Mesh Control Down": "1. Check linkerd check 2. Restart control plane 3. Verify data plane healthy",
"Certificate Expiry": "1. Check cert-manager 2. Renew certificate 3. Verify mTLS working",
}
print(f"\n\nRunbook:")
for k, v in runbook.items():
print(f" [{k}]: {v}")
เคล็ดลับ
- Internal: Deploy Uptime Kuma ภายใน Cluster ตรวจ Internal Service
- Retry: ตั้ง Retry 2-3 ครั้งก่อน Alert ลด False Positive
- Group: จัดกลุ่ม Monitor ตาม Namespace และ Priority
- Backup: Backup Uptime Kuma Data (/app/data) ทุกวัน
- Runbook: สร้าง Runbook สำหรับทุก Alert Scenario
ใช้ Uptime Kuma กับ Service Mesh อย่างไร
HTTP Health TCP Port DNS gRPC Monitor ทุก Service Internal DNS ClusterIP Response Time Status Code Alert Linkerd Istio Observability
Monitor อะไรบ้างใน Service Mesh
HTTP /health /ready TCP Database Redis gRPC Health DNS Service Discovery Push CronJob Docker Sidecar Certificate mTLS Control Plane
ตั้ง Alert อย่างไร
Telegram Slack PagerDuty Discord Email Down Response Time Certificate Retry False Positive Escalation Policy Critical Service Priority
สร้าง Status Page สำหรับ Mesh อย่างไร
Service Group Frontend Backend Data Infrastructure Namespace Uptime Percentage Response Time Incident Custom Domain Public Internal Maintenance
สรุป
Uptime Kuma Monitoring Service Mesh Health Check HTTP TCP DNS gRPC Alerting PagerDuty Slack Status Page Linkerd Istio Kubernetes Runbook Production Operations
อ่านเพิ่มเติม: สอนเทรด Forex | XM Signal | IT Hardware | อาชีพ IT
