Skaffold Dev Monitoring
Skaffold Dev Monitoring Alerting Prometheus Grafana Loki Log Aggregation Health Check Observability Metrics Dashboard Alert Pod Status Resource Usage
| Tool | Purpose | Data Type | Query Language | เหมาะกับ |
|---|---|---|---|---|
| Prometheus | Metrics Collection | Time Series | PromQL | Metrics |
| Grafana | Visualization | Dashboard | Multiple | Dashboard |
| Loki | Log Aggregation | Logs | LogQL | Logs |
| Jaeger | Distributed Tracing | Traces | Search | Tracing |
| Alertmanager | Alert Routing | Alerts | Rules | Notification |
Prometheus Setup
# === Prometheus + Grafana for Skaffold Dev ===
# Install with Helm
# helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
# helm install monitoring prometheus-community/kube-prometheus-stack \
# --namespace monitoring --create-namespace \
# --set grafana.adminPassword=admin123
# Skaffold config — include monitoring
# apiVersion: skaffold/v4beta7
# kind: Config
# metadata:
# name: my-app
# build:
# artifacts:
# - image: my-app
# docker:
# dockerfile: Dockerfile
# deploy:
# kubectl:
# manifests:
# - k8s/*.yaml
# portForward:
# - resourceType: service
# resourceName: my-app
# port: 8080
# localPort: 8080
# - resourceType: service
# resourceName: monitoring-grafana
# namespace: monitoring
# port: 80
# localPort: 3000
# - resourceType: service
# resourceName: monitoring-kube-prometheus-prometheus
# namespace: monitoring
# port: 9090
# localPort: 9090
# Application Metrics — Python Flask
# from prometheus_client import Counter, Histogram, generate_latest
# from flask import Flask, Response
#
# app = Flask(__name__)
#
# REQUEST_COUNT = Counter('http_requests_total',
# 'Total HTTP requests', ['method', 'endpoint', 'status'])
# REQUEST_LATENCY = Histogram('http_request_duration_seconds',
# 'HTTP request latency', ['method', 'endpoint'])
#
# @app.before_request
# def before_request():
# request.start_time = time.time()
#
# @app.after_request
# def after_request(response):
# latency = time.time() - request.start_time
# REQUEST_COUNT.labels(request.method, request.path, response.status_code).inc()
# REQUEST_LATENCY.labels(request.method, request.path).observe(latency)
# return response
#
# @app.route('/metrics')
# def metrics():
# return Response(generate_latest(), mimetype='text/plain')
# ServiceMonitor — Prometheus scrape config
# apiVersion: monitoring.coreos.com/v1
# kind: ServiceMonitor
# metadata:
# name: my-app
# labels:
# release: monitoring
# spec:
# selector:
# matchLabels:
# app: my-app
# endpoints:
# - port: http
# path: /metrics
# interval: 15s
from dataclasses import dataclass
@dataclass
class MetricConfig:
metric: str
type_val: str
labels: str
alert_threshold: str
dashboard: str
metrics = [
MetricConfig("http_requests_total", "Counter", "method endpoint status", "> 100 errors/min", "Request Rate"),
MetricConfig("http_request_duration_seconds", "Histogram", "method endpoint", "p99 > 500ms", "Latency"),
MetricConfig("container_cpu_usage_seconds_total", "Counter", "pod namespace", "> 80% limit", "CPU Usage"),
MetricConfig("container_memory_working_set_bytes", "Gauge", "pod namespace", "> 80% limit", "Memory"),
MetricConfig("kube_pod_status_phase", "Gauge", "pod phase", "phase != Running", "Pod Status"),
]
print("=== Monitoring Metrics ===")
for m in metrics:
print(f" [{m.metric}] Type: {m.type_val}")
print(f" Labels: {m.labels}")
print(f" Alert: {m.alert_threshold} | Dashboard: {m.dashboard}")
Log Aggregation
# === Loki Log Aggregation ===
# Install Loki Stack
# helm install loki grafana/loki-stack \
# --namespace monitoring \
# --set promtail.enabled=true \
# --set grafana.enabled=false # use existing Grafana
# Structured Logging — Python
# import logging
# import json
# from datetime import datetime
#
# class JSONFormatter(logging.Formatter):
# def format(self, record):
# log_entry = {
# "timestamp": datetime.utcnow().isoformat(),
# "level": record.levelname,
# "message": record.getMessage(),
# "logger": record.name,
# "module": record.module,
# "function": record.funcName,
# "line": record.lineno,
# }
# if hasattr(record, 'correlation_id'):
# log_entry['correlation_id'] = record.correlation_id
# if record.exc_info:
# log_entry['exception'] = self.formatException(record.exc_info)
# return json.dumps(log_entry)
#
# handler = logging.StreamHandler()
# handler.setFormatter(JSONFormatter())
# logger = logging.getLogger('my-app')
# logger.addHandler(handler)
# LogQL Queries in Grafana
# {app="my-app"} | json | level="error"
# {app="my-app"} | json | latency_ms > 500
# {namespace="production"} |= "Exception"
# rate({app="my-app"} | json | level="error" [5m])
@dataclass
class LogQuery:
name: str
logql: str
purpose: str
alert: bool
queries = [
LogQuery("Error Logs", '{app="my-app"} | json | level="error"', "ดู Error ทั้งหมด", True),
LogQuery("Slow Requests", '{app="my-app"} | json | latency_ms > 500', "Request ที่ช้า", True),
LogQuery("Exception Trace", '{app="my-app"} |= "Exception"', "ดู Stack Trace", False),
LogQuery("Error Rate", 'rate({app="my-app"} | json | level="error" [5m])', "อัตรา Error", True),
LogQuery("Request Volume", 'rate({app="my-app"} [1m])', "ปริมาณ Request", False),
]
print("\n=== LogQL Queries ===")
for q in queries:
alert_tag = "[ALERT]" if q.alert else ""
print(f" [{q.name}] {alert_tag}")
print(f" Query: {q.logql}")
print(f" Purpose: {q.purpose}")
Alerting
# === Alerting Configuration ===
# PrometheusRule — Alert Definitions
# apiVersion: monitoring.coreos.com/v1
# kind: PrometheusRule
# metadata:
# name: my-app-alerts
# labels:
# release: monitoring
# spec:
# groups:
# - name: my-app
# rules:
# - alert: HighErrorRate
# expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
# for: 5m
# labels:
# severity: critical
# annotations:
# summary: "High error rate on {{ $labels.instance }}"
# description: "Error rate is {{ $value | humanizePercentage }}"
#
# - alert: HighLatency
# expr: histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) > 0.5
# for: 5m
# labels:
# severity: warning
#
# - alert: PodCrashLooping
# expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
# for: 5m
# labels:
# severity: critical
# Alertmanager — Slack Notification
# apiVersion: monitoring.coreos.com/v1alpha1
# kind: AlertmanagerConfig
# spec:
# route:
# receiver: slack
# groupBy: [alertname, namespace]
# groupWait: 30s
# groupInterval: 5m
# repeatInterval: 4h
# receivers:
# - name: slack
# slackConfigs:
# - channel: '#alerts'
# apiURL: 'https://hooks.slack.com/services/...'
# title: '[{{ .Status }}] {{ .CommonLabels.alertname }}'
# text: '{{ .CommonAnnotations.description }}'
@dataclass
class AlertRule:
name: str
condition: str
severity: str
action: str
notification: str
alerts = [
AlertRule("HighErrorRate", "5xx > 5% for 5min", "Critical", "Investigate immediately", "Slack + PagerDuty"),
AlertRule("HighLatency", "p99 > 500ms for 5min", "Warning", "Check slow endpoints", "Slack"),
AlertRule("PodCrashLooping", "Restart > 0 in 15min", "Critical", "Check pod logs", "Slack + PagerDuty"),
AlertRule("HighCPU", "CPU > 80% for 10min", "Warning", "Scale up or optimize", "Slack"),
AlertRule("HighMemory", "Memory > 85% for 5min", "Warning", "Check memory leaks", "Slack"),
AlertRule("DiskFull", "Disk > 90%", "Critical", "Clean up or expand", "Slack + PagerDuty"),
]
print("Alert Rules:")
for a in alerts:
print(f" [{a.severity}] {a.name}")
print(f" Condition: {a.condition}")
print(f" Action: {a.action} | Notify: {a.notification}")
เคล็ดลับ
- Metrics: เก็บ RED Metrics (Rate Errors Duration) ทุก Service
- Structured Log: ใช้ JSON Log Format ค้นหาง่าย
- Alert: ตั้ง Alert เฉพาะที่ต้อง Action ไม่ Alert มากจนชิน
- Dashboard: สร้าง Dashboard สำหรับทุก Service ดูภาพรวม
- Dev Parity: ใช้ Monitoring เหมือนกันทั้ง Dev และ Production
Skaffold Dev Monitoring คืออะไร
ตรวจสอบสถานะ Application Dev Skaffold Pod Status Log Port Forwarding Prometheus Grafana Loki Alertmanager Metrics Dashboard Bug Performance
ตั้งค่า Prometheus กับ Grafana อย่างไร
Helm install kube-prometheus-stack ServiceMonitor Dashboard CPU Memory Request Error Latency Alert Rules 80% CPU 5% Error Alertmanager Slack Email
Log Aggregation ทำอย่างไร
Loki Promtail Pod stdout Grafana LogQL Labels App Namespace Structured Logging JSON Log Level Correlation ID ติดตาม Request
Health Check ตั้งค่าอย่างไร
Liveness Probe Container Restart Readiness Probe Traffic Startup Probe httpGet /health /ready initialDelaySeconds periodSeconds Database Redis Dependencies
สรุป
Skaffold Dev Monitoring Alerting Prometheus Grafana Loki Log Aggregation Health Check Observability Metrics Dashboard Alert PromQL LogQL Production
