Prometheus และ Grafana คืออะไร? สอนสร้าง Monitoring Stack สำหรับ Infrastructure และ Application 2026

ในโลก DevOps และ Infrastructure ปี 2026 การ Monitor ระบบไม่ใช่สิ่งที่ "ทำก็ได้ ไม่ทำก็ได้" อีกต่อไป แต่เป็นหัวใจสำคัญของการดูแลระบบ Production Prometheus และ Grafana คือคู่หูที่ทรงพลังที่สุดสำหรับ Monitoring ที่ใช้กันทั่วโลก ตั้งแต่ Startup จนถึง Enterprise ระดับ Fortune 500

บทความนี้จะสอนทุกอย่างเกี่ยวกับ Prometheus และ Grafana ตั้งแต่แนวคิดพื้นฐาน การติดตั้ง PromQL สำหรับ Query ข้อมูล การสร้าง Dashboard การตั้ง Alert ไปจนถึง Grafana Loki สำหรับ Logs และ Tempo สำหรับ Traces รวมถึง Best Practices ที่ใช้ในระบบ Production จริง

Prometheus คืออะไร?

Prometheus คือระบบ Monitoring และ Alerting แบบ Open Source ที่พัฒนาโดย SoundCloud ในปี 2012 และปัจจุบันเป็นโปรเจกต์ภายใต้ Cloud Native Computing Foundation (CNCF) เช่นเดียวกับ Kubernetes Prometheus ใช้ระบบ Pull-based คือ Prometheus Server จะไปดึง (scrape) Metrics จาก Targets เป็นระยะๆ แทนที่จะให้ Application ส่ง Metrics มาหา

ทำไม Pull-based ดีกว่า Push-based?

Centralized Control: Prometheus ควบคุมว่าจะ Scrape อะไร เมื่อไหร่ ความถี่เท่าไหร่
Health Check ฟรี: ถ้า Scrape ไม่สำเร็จ หมายความว่า Target ล่ม เป็น Health Check ในตัว
ไม่ต้อง Config ฝั่ง Application: Application แค่ Expose Metrics endpoint (/metrics) ไม่ต้องรู้ว่า Prometheus อยู่ที่ไหน
ง่ายต่อการ Debug: สามารถเปิด /metrics ในเบราว์เซอร์ดูได้เลยว่า Application expose อะไรอยู่

Prometheus Architecture

Prometheus ประกอบด้วยหลาย Component ที่ทำงานร่วมกัน:

# Prometheus Architecture Overview
#
# [Targets] ← scrape ← [Prometheus Server] → [Alertmanager] → [Slack/Email/PagerDuty]
#   ├── App /metrics              ├── TSDB (Time Series DB)
#   ├── Node Exporter             ├── HTTP API
#   ├── cAdvisor                  └── PromQL Engine
#   └── Custom Exporters
#                                 [Grafana] ← query ← [Prometheus Server]
#                                     └── Dashboards & Visualizations
#
# [Pushgateway] ← push ← [Short-lived Jobs]  (Batch jobs, Cron jobs)
#                  scrape ↗
#            [Prometheus Server]

Component สำคัญ

Prometheus Server: หัวใจหลัก ทำหน้าที่ Scrape, Store และ Query Metrics
Exporters: โปรแกรมที่แปลง Metrics จาก Application/System เป็นรูปแบบที่ Prometheus เข้าใจ
Alertmanager: จัดการ Alert (routing, grouping, silencing, notification)
Pushgateway: สำหรับ Short-lived Jobs ที่ไม่สามารถถูก Scrape ได้ (เช่น Cron jobs)
TSDB: Time Series Database ที่เก็บ Metrics data อย่างมีประสิทธิภาพ

ติดตั้ง Prometheus Stack ด้วย Docker Compose

# docker-compose.yml - Prometheus + Grafana + Alertmanager
version: '3.8'

services:
  prometheus:
    image: prom/prometheus:v2.52.0
    container_name: prometheus
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
      - ./prometheus/alert_rules.yml:/etc/prometheus/alert_rules.yml
      - prometheus_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--storage.tsdb.retention.time=30d'
      - '--web.enable-lifecycle'
    restart: unless-stopped

  grafana:
    image: grafana/grafana:11.0.0
    container_name: grafana
    ports:
      - "3000:3000"
    volumes:
      - grafana_data:/var/lib/grafana
      - ./grafana/provisioning:/etc/grafana/provisioning
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=your_secure_password
      - GF_USERS_ALLOW_SIGN_UP=false
    restart: unless-stopped

  alertmanager:
    image: prom/alertmanager:v0.27.0
    container_name: alertmanager
    ports:
      - "9093:9093"
    volumes:
      - ./alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml
    restart: unless-stopped

  node-exporter:
    image: prom/node-exporter:v1.8.0
    container_name: node-exporter
    ports:
      - "9100:9100"
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
    command:
      - '--path.procfs=/host/proc'
      - '--path.sysfs=/host/sys'
      - '--path.rootfs=/rootfs'
    restart: unless-stopped

volumes:
  prometheus_data:
  grafana_data:

# prometheus/prometheus.yml - Prometheus Configuration
global:
  scrape_interval: 15s          # Scrape ทุก 15 วินาที
  evaluation_interval: 15s      # Evaluate rules ทุก 15 วินาที
  scrape_timeout: 10s

alerting:
  alertmanagers:
    - static_configs:
        - targets: ['alertmanager:9093']

rule_files:
  - "alert_rules.yml"

scrape_configs:
  # Prometheus monitors itself
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  # Node Exporter - Server metrics
  - job_name: 'node-exporter'
    static_configs:
      - targets: ['node-exporter:9100']
        labels:
          instance: 'web-server-01'

  # Application metrics
  - job_name: 'my-app'
    metrics_path: '/metrics'
    scrape_interval: 10s
    static_configs:
      - targets: ['app:8080']
        labels:
          environment: 'production'
          team: 'backend'

  # Kubernetes Service Discovery
  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)

Metric Types — ชนิดของ Metrics

Prometheus มี 4 ชนิด Metric หลักที่ต้องเข้าใจ:

1. Counter

ค่าที่เพิ่มขึ้นเรื่อยๆ เท่านั้น ไม่มีวันลด เช่น จำนวน Request ทั้งหมด จำนวน Error ทั้งหมด Reset ได้แค่เมื่อ Restart Application

# Python - Counter
from prometheus_client import Counter

REQUEST_COUNT = Counter(
    'http_requests_total',
    'Total HTTP requests',
    ['method', 'endpoint', 'status_code']
)

# เพิ่มค่า
REQUEST_COUNT.labels(method='GET', endpoint='/api/users', status_code='200').inc()
REQUEST_COUNT.labels(method='POST', endpoint='/api/orders', status_code='201').inc()

2. Gauge

ค่าที่ขึ้นลงได้ เช่น อุณหภูมิ CPU ใช้งาน จำนวน Active connections จำนวน Items ใน Queue

# Python - Gauge
from prometheus_client import Gauge

ACTIVE_CONNECTIONS = Gauge(
    'active_connections',
    'Number of active connections',
    ['server']
)

TEMPERATURE = Gauge('cpu_temperature_celsius', 'CPU Temperature')

# ตั้งค่า / เพิ่ม / ลด
ACTIVE_CONNECTIONS.labels(server='web-01').set(42)
ACTIVE_CONNECTIONS.labels(server='web-01').inc()     # +1
ACTIVE_CONNECTIONS.labels(server='web-01').dec(5)     # -5
TEMPERATURE.set(72.5)

3. Histogram

วัดการกระจายตัวของค่า เช่น Request Duration โดยแบ่งเป็น Buckets จะได้ Percentile (p50, p90, p99) ได้

# Python - Histogram
from prometheus_client import Histogram

REQUEST_DURATION = Histogram(
    'http_request_duration_seconds',
    'HTTP request duration in seconds',
    ['method', 'endpoint'],
    buckets=[0.01, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0]
)

# วัดเวลา
import time
start = time.time()
process_request()
duration = time.time() - start
REQUEST_DURATION.labels(method='GET', endpoint='/api/users').observe(duration)

# หรือใช้ Decorator
@REQUEST_DURATION.labels(method='GET', endpoint='/api/users').time()
def handle_request():
    # ... process ...

4. Summary

คล้าย Histogram แต่คำนวณ Quantile ฝั่ง Client (Application) แทนที่จะคำนวณฝั่ง Prometheus ข้อเสียคือไม่สามารถ Aggregate ข้ามหลาย Instances ได้ จึงนิยมใช้ Histogram มากกว่าในปัจจุบัน

# Python - Summary
from prometheus_client import Summary

REQUEST_LATENCY = Summary(
    'request_latency_seconds',
    'Request latency',
    ['endpoint']
)

REQUEST_LATENCY.labels(endpoint='/api').observe(0.123)

Best Practice: ใช้ Histogram แทน Summary เกือบทุกกรณี เพราะ Histogram สามารถ aggregate ข้ามหลาย instances ได้ และคำนวณ percentile ฝั่ง query time ได้ยืดหยุ่นกว่า

PromQL — ภาษา Query ของ Prometheus

PromQL (Prometheus Query Language) เป็นภาษาสำหรับ Query ข้อมูล Time Series จาก Prometheus เป็นหัวใจของการใช้ Prometheus ต้องเข้าใจ PromQL จึงจะใช้ Prometheus ได้อย่างเต็มประสิทธิภาพ

Selectors พื้นฐาน

# Instant Vector - ค่า ณ เวลาปัจจุบัน
http_requests_total

# Label Matching
http_requests_total{method="GET"}
http_requests_total{method="GET", status_code="200"}
http_requests_total{method!="DELETE"}                 # ไม่เท่ากับ
http_requests_total{endpoint=~"/api/.*"}              # Regex match
http_requests_total{endpoint!~"/health|/metrics"}     # Regex ไม่ match

# Range Vector - ค่าในช่วงเวลา
http_requests_total[5m]     # 5 นาทีล่าสุด
http_requests_total[1h]     # 1 ชั่วโมงล่าสุด

# Offset - ย้อนเวลา
http_requests_total offset 1h    # ค่าเมื่อ 1 ชั่วโมงที่แล้ว

Functions ที่ใช้บ่อย

# rate() - อัตราต่อวินาทีของ Counter (ใช้บ่อยมาก)
rate(http_requests_total[5m])
# ผลลัพธ์: requests ต่อวินาที เฉลี่ย 5 นาที

# irate() - Instant rate (ใช้ 2 data points ล่าสุด)
irate(http_requests_total[5m])
# ผลลัพธ์: requests ต่อวินาทีล่าสุด (ไวกว่า rate)

# increase() - จำนวนที่เพิ่มขึ้นในช่วงเวลา
increase(http_requests_total[1h])
# ผลลัพธ์: จำนวน requests ใน 1 ชั่วโมง

# histogram_quantile() - คำนวณ Percentile จาก Histogram
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))
# ผลลัพธ์: p99 latency

histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
# ผลลัพธ์: p95 latency

# avg, sum, min, max, count - Aggregation
sum(rate(http_requests_total[5m]))                          # รวมทุก instances
avg(rate(http_requests_total[5m])) by (method)              # เฉลี่ยแยกตาม method
sum(rate(http_requests_total[5m])) by (status_code)         # รวมแยกตาม status
topk(5, rate(http_requests_total[5m]))                      # top 5 ที่สูงสุด

# predict_linear() - ทำนายค่าในอนาคต
predict_linear(node_filesystem_avail_bytes[6h], 24*3600)
# ผลลัพธ์: ทำนายว่า disk จะเหลือเท่าไหร่ใน 24 ชม. ข้างหน้า

# delta() - ความเปลี่ยนแปลงของ Gauge
delta(cpu_temperature_celsius[10m])
# ผลลัพธ์: อุณหภูมิ CPU เปลี่ยนไปเท่าไหร่ใน 10 นาที

# absent() - ตรวจว่า Metric หายไป (Target down)
absent(up{job="my-app"})
# ผลลัพธ์: 1 ถ้า Target ไม่ตอบ (ดีสำหรับ Alert)

Operators

# Arithmetic
http_requests_total / 1000                              # หารด้วย 1000
node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes  # Memory ที่ใช้

# Percentage
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes)
  / node_memory_MemTotal_bytes * 100
# ผลลัพธ์: % Memory ที่ใช้

# Error Rate
sum(rate(http_requests_total{status_code=~"5.."}[5m]))
  / sum(rate(http_requests_total[5m])) * 100
# ผลลัพธ์: % Error Rate

# Comparison
http_requests_total > 1000                              # เฉพาะที่มากกว่า 1000
node_cpu_seconds_total{mode="idle"} < 0.1             # CPU idle น้อยกว่า 10%

Recording Rules

Pre-compute PromQL expressions ที่ซับซ้อนเก็บเป็น Metric ใหม่ ช่วยให้ Dashboard โหลดเร็วขึ้น

# prometheus/recording_rules.yml
groups:
  - name: http_rules
    interval: 15s
    rules:
      # Pre-compute request rate per endpoint
      - record: job:http_requests_total:rate5m
        expr: sum(rate(http_requests_total[5m])) by (job, method, endpoint)

      # Pre-compute error rate
      - record: job:http_error_rate:ratio5m
        expr: |
          sum(rate(http_requests_total{status_code=~"5.."}[5m])) by (job)
          / sum(rate(http_requests_total[5m])) by (job)

      # Pre-compute p99 latency
      - record: job:http_request_duration_seconds:p99
        expr: |
          histogram_quantile(0.99,
            sum(rate(http_request_duration_seconds_bucket[5m])) by (job, le)
          )

Node Exporter — Monitor เครื่อง Server

Node Exporter คือ Exporter สำหรับ Linux/Unix servers ที่เก็บ Metrics เกี่ยวกับ CPU, Memory, Disk, Network และอื่นๆ เป็น Exporter ที่ใช้บ่อยที่สุด

# PromQL สำหรับ Server Monitoring

# CPU Usage (%)
100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Memory Usage (%)
(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100

# Disk Usage (%)
(1 - node_filesystem_avail_bytes{mountpoint="/"}
     / node_filesystem_size_bytes{mountpoint="/"}) * 100

# Network Traffic (bytes/sec)
rate(node_network_receive_bytes_total{device="eth0"}[5m])
rate(node_network_transmit_bytes_total{device="eth0"}[5m])

# System Load
node_load1    # 1 minute load average
node_load5    # 5 minute load average
node_load15   # 15 minute load average

# Disk I/O
rate(node_disk_read_bytes_total[5m])
rate(node_disk_written_bytes_total[5m])

Application Metrics — วัดผลตัว Application

Python (Flask/FastAPI)

# pip install prometheus-client prometheus-fastapi-instrumentator

# FastAPI - Auto instrumentation
from fastapi import FastAPI
from prometheus_fastapi_instrumentator import Instrumentator

app = FastAPI()
Instrumentator().instrument(app).expose(app, endpoint="/metrics")

# Custom Metrics
from prometheus_client import Counter, Histogram, Gauge

ORDERS_CREATED = Counter(
    'orders_created_total', 'Total orders created', ['payment_method']
)
ORDER_PROCESSING_TIME = Histogram(
    'order_processing_seconds', 'Order processing time',
    buckets=[0.1, 0.5, 1, 2, 5, 10]
)
ACTIVE_USERS = Gauge('active_users', 'Currently active users')

@app.post("/orders")
async def create_order(order: OrderRequest):
    with ORDER_PROCESSING_TIME.time():
        result = process_order(order)
    ORDERS_CREATED.labels(payment_method=order.payment_method).inc()
    return result

Go

// Go - Prometheus Client
package main

import (
    "net/http"
    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promhttp"
)

var (
    httpRequests = prometheus.NewCounterVec(
        prometheus.CounterOpts{
            Name: "http_requests_total",
            Help: "Total HTTP requests",
        },
        []string{"method", "endpoint", "status"},
    )

    requestDuration = prometheus.NewHistogramVec(
        prometheus.HistogramOpts{
            Name:    "http_request_duration_seconds",
            Help:    "HTTP request duration",
            Buckets: prometheus.DefBuckets,
        },
        []string{"method", "endpoint"},
    )
)

func init() {
    prometheus.MustRegister(httpRequests)
    prometheus.MustRegister(requestDuration)
}

func main() {
    http.Handle("/metrics", promhttp.Handler())
    http.ListenAndServe(":8080", nil)
}

Service Discovery

แทนที่จะต้อง Config Targets ทีละตัว Prometheus รองรับ Service Discovery หลายแบบ:

# 1. Static Config
scrape_configs:
  - job_name: 'web-servers'
    static_configs:
      - targets: ['web-01:9100', 'web-02:9100', 'web-03:9100']

# 2. DNS Service Discovery
  - job_name: 'dns-discovery'
    dns_sd_configs:
      - names: ['_prometheus._tcp.example.com']
        type: SRV

# 3. Kubernetes Service Discovery
  - job_name: 'k8s-pods'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true

# 4. Consul Service Discovery
  - job_name: 'consul'
    consul_sd_configs:
      - server: 'consul:8500'
        services: ['web', 'api', 'worker']

# 5. File-based Service Discovery (dynamic)
  - job_name: 'file-discovery'
    file_sd_configs:
      - files:
          - '/etc/prometheus/targets/*.json'
        refresh_interval: 30s

Alerting Rules และ Alertmanager

Alert Rules

# prometheus/alert_rules.yml
groups:
  - name: infrastructure
    rules:
      # Server Down
      - alert: InstanceDown
        expr: up == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Instance {{ $labels.instance }} down"
          description: "{{ $labels.instance }} has been down for more than 2 minutes."

      # High CPU
      - alert: HighCPU
        expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High CPU usage on {{ $labels.instance }}"
          description: "CPU usage is above 80% for 5 minutes. Current: {{ $value }}%"

      # Disk Almost Full
      - alert: DiskSpaceLow
        expr: (1 - node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 > 85
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Low disk space on {{ $labels.instance }}"
          description: "Disk usage is {{ $value }}%"

      # Disk Predicted Full
      - alert: DiskWillFillIn24h
        expr: predict_linear(node_filesystem_avail_bytes{mountpoint="/"}[6h], 24*3600) < 0
        for: 1h
        labels:
          severity: warning
        annotations:
          summary: "Disk predicted to fill within 24h on {{ $labels.instance }}"

  - name: application
    rules:
      # High Error Rate
      - alert: HighErrorRate
        expr: |
          sum(rate(http_requests_total{status_code=~"5.."}[5m])) by (job)
          / sum(rate(http_requests_total[5m])) by (job) > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate on {{ $labels.job }}"
          description: "Error rate is above 5%. Current: {{ $value | humanizePercentage }}"

      # High Latency
      - alert: HighLatency
        expr: |
          histogram_quantile(0.99,
            sum(rate(http_request_duration_seconds_bucket[5m])) by (job, le)
          ) > 1.0
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High p99 latency on {{ $labels.job }}"
          description: "p99 latency is above 1 second. Current: {{ $value }}s"

Alertmanager Configuration

# alertmanager/alertmanager.yml
global:
  resolve_timeout: 5m
  smtp_smarthost: 'smtp.gmail.com:587'
  smtp_from: 'alerts@example.com'
  smtp_auth_username: 'alerts@example.com'
  smtp_auth_password: 'app_password'

route:
  group_by: ['alertname', 'severity']
  group_wait: 30s           # รอ 30s ก่อนส่ง Alert แรก (จะได้จัดกลุ่ม)
  group_interval: 5m        # รอ 5m ก่อนส่ง Alert ซ้ำในกลุ่มเดิม
  repeat_interval: 4h       # ส่ง Alert ซ้ำทุก 4 ชม. ถ้ายังไม่ resolve
  receiver: 'default'

  routes:
    # Critical alerts → PagerDuty + Slack
    - match:
        severity: critical
      receiver: 'critical-alerts'
      group_wait: 10s        # ส่งเร็วกว่าสำหรับ Critical

    # Warning alerts → Slack only
    - match:
        severity: warning
      receiver: 'warning-alerts'

receivers:
  - name: 'default'
    email_configs:
      - to: 'team@example.com'

  - name: 'critical-alerts'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/xxx/yyy/zzz'
        channel: '#alerts-critical'
        title: '[CRITICAL] {{ .GroupLabels.alertname }}'
        text: '{{ range .Alerts }}{{ .Annotations.description }}\n{{ end }}'
    pagerduty_configs:
      - service_key: 'your-pagerduty-key'

  - name: 'warning-alerts'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/xxx/yyy/zzz'
        channel: '#alerts-warning'

# Silence - หยุดส่ง Alert ชั่วคราว (ตอน Maintenance)
inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'instance']   # ถ้ามี Critical อยู่แล้ว ไม่ต้องส่ง Warning

Grafana — Dashboard และ Visualization

การตั้งค่า Data Sources

# grafana/provisioning/datasources/datasources.yml
apiVersion: 1
datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prometheus:9090
    isDefault: true

  - name: Loki
    type: loki
    access: proxy
    url: http://loki:3100

  - name: Tempo
    type: tempo
    access: proxy
    url: http://tempo:3200

สร้าง Dashboard

Grafana Dashboard ประกอบด้วย Panels หลายแบบ:

Time Series: กราฟเส้นแสดงค่าตามเวลา (ใช้บ่อยที่สุด)
Stat: แสดงตัวเลขค่าเดียว เช่น Uptime, Current CPU%
Gauge: แสดงค่าในรูปมิเตอร์วงกลม พร้อม Threshold สี
Table: แสดงข้อมูลเป็นตาราง
Heatmap: แสดง Distribution ของข้อมูลตามเวลา เหมาะกับ Histogram
Logs: แสดง Logs จาก Loki

# Dashboard JSON Model (ส่วนสำคัญ)
# สร้างผ่าน Grafana UI แล้ว Export เป็น JSON ได้
{
  "dashboard": {
    "title": "Application Overview",
    "panels": [
      {
        "title": "Request Rate",
        "type": "timeseries",
        "targets": [{
          "expr": "sum(rate(http_requests_total[5m])) by (method)",
          "legendFormat": "{{method}}"
        }]
      },
      {
        "title": "Error Rate (%)",
        "type": "stat",
        "targets": [{
          "expr": "sum(rate(http_requests_total{status_code=~\"5..\"}[5m])) / sum(rate(http_requests_total[5m])) * 100"
        }],
        "fieldConfig": {
          "defaults": {
            "thresholds": {
              "steps": [
                {"value": 0, "color": "green"},
                {"value": 1, "color": "yellow"},
                {"value": 5, "color": "red"}
              ]
            }
          }
        }
      },
      {
        "title": "p99 Latency",
        "type": "timeseries",
        "targets": [{
          "expr": "histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))"
        }]
      }
    ]
  }
}

Variables สำหรับ Dashboard แบบ Dynamic

# Grafana Variables (ตั้งค่าใน Dashboard Settings > Variables)

# Variable: instance
# Type: Query
# Query: label_values(up, instance)
# ผลลัพธ์: Dropdown ให้เลือก Instance

# Variable: job
# Type: Query
# Query: label_values(up, job)

# ใช้ใน Panel Query:
rate(http_requests_total{instance="$instance", job="$job"}[5m])
# $instance และ $job จะเปลี่ยนตาม Dropdown ที่ผู้ใช้เลือก

Grafana Alerting

นอกจาก Alertmanager แล้ว Grafana 11.x มีระบบ Alerting ในตัวที่ทำงานได้ดีมาก สามารถส่ง Alert ผ่าน Slack, Email, PagerDuty, Webhook และอื่นๆ ได้โดยตรงจาก Grafana

# Grafana Alert Rule Example (YAML format)
# สร้างผ่าน Grafana UI: Alerting > Alert Rules > New
#
# Condition: avg(rate(http_requests_total{status_code=~"5.."}[5m]))
#            / avg(rate(http_requests_total[5m])) > 0.05
# For: 5m
# Labels: severity=critical, team=backend
# Annotations:
#   summary: High error rate detected
#   description: Error rate is above 5%
# Notification: Slack #alerts channel

Loki — Log Aggregation

Grafana Loki คือระบบ Log Aggregation ที่ออกแบบมาให้ทำงานร่วมกับ Grafana และ Prometheus ได้อย่างลงตัว จุดเด่นคือ Index เฉพาะ Labels (ไม่ Index เนื้อหา Log) ทำให้ Storage ถูกกว่า Elasticsearch มาก

# docker-compose.yml - เพิ่ม Loki + Promtail
  loki:
    image: grafana/loki:3.0.0
    ports:
      - "3100:3100"
    volumes:
      - loki_data:/loki
    command: -config.file=/etc/loki/local-config.yaml

  promtail:
    image: grafana/promtail:3.0.0
    volumes:
      - /var/log:/var/log:ro
      - ./promtail/config.yml:/etc/promtail/config.yml
    command: -config.file=/etc/promtail/config.yml

# LogQL - Query Language ของ Loki
# Log Stream Selector
{job="my-app"}
{job="my-app", level="error"}
{job=~"api-.*"}                              # Regex match

# Filter Expression
{job="my-app"} |= "error"                    # contains "error"
{job="my-app"} != "healthcheck"              # ไม่มีคำว่า "healthcheck"
{job="my-app"} |~ "timeout|connection refused"  # Regex match

# JSON Parsing
{job="my-app"} | json | status_code >= 500
{job="my-app"} | json | duration > 1s

# Metric Queries from Logs
rate({job="my-app"} |= "error" [5m])         # Error log rate
count_over_time({job="my-app"} |= "error" [1h])  # Error count in 1h
sum by (level) (rate({job="my-app"} | json [5m]))  # Rate by log level

Tempo — Distributed Tracing

Grafana Tempo คือระบบ Distributed Tracing ที่ Scale ได้ดี ใช้ Object Storage (S3/GCS) เป็น Backend ทำงานร่วมกับ OpenTelemetry, Jaeger และ Zipkin

# docker-compose.yml - เพิ่ม Tempo
  tempo:
    image: grafana/tempo:2.4.0
    ports:
      - "3200:3200"     # HTTP
      - "4317:4317"     # OTLP gRPC
      - "4318:4318"     # OTLP HTTP
    volumes:
      - tempo_data:/tmp/tempo
    command: -config.file=/etc/tempo/config.yaml

# Application - OpenTelemetry instrumentation
# pip install opentelemetry-api opentelemetry-sdk opentelemetry-exporter-otlp

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter

# Setup
provider = TracerProvider()
provider.add_span_processor(
    BatchSpanProcessor(OTLPSpanExporter(endpoint="http://tempo:4317"))
)
trace.set_tracer_provider(provider)
tracer = trace.get_tracer("my-app")

# Usage
with tracer.start_as_current_span("process-order") as span:
    span.set_attribute("order.id", order_id)
    result = process_order(order)
    span.set_attribute("order.total", result.total)

Mimir — Long-Term Storage

Prometheus เก็บข้อมูลในเครื่อง (Local TSDB) ซึ่งมีข้อจำกัดเรื่อง Storage และ HA Grafana Mimir (เดิมคือ Cortex) เป็น Long-term storage สำหรับ Prometheus Metrics รองรับ Multi-tenancy และ Scale ได้ไม่จำกัด

# prometheus.yml - ส่ง Metrics ไป Mimir
remote_write:
  - url: http://mimir:9009/api/v1/push
    headers:
      X-Scope-OrgID: my-tenant

Prometheus vs Datadog vs New Relic

เกณฑ์	Prometheus + Grafana	Datadog	New Relic
ราคา	ฟรี (Open Source)	แพง ($15-23/host/month)	แพง (usage-based)
Hosting	Self-hosted	SaaS	SaaS
Customization	สูงมาก	ปานกลาง	ปานกลาง
Learning Curve	สูง (PromQL)	ต่ำ	ต่ำ
K8s Integration	ดีเยี่ยม (Native)	ดีมาก	ดี
APM	ต้องใช้ Tempo/Jaeger	Built-in	Built-in
Logs	Loki	Built-in	Built-in
Community	ใหญ่มาก (CNCF)	ปานกลาง	ปานกลาง
เหมาะกับ	ทีมที่มี DevOps	ทีมที่ต้องการ Turn-key	ทีม Dev-focused

คำแนะนำ: ถ้าทีมมี DevOps Engineer และต้องการ Control สูงสุด เลือก Prometheus + Grafana ถ้าทีมเล็กและไม่อยาก Manage Infrastructure เลือก Datadog หรือ Grafana Cloud (Managed Prometheus + Grafana)

Monitoring Best Practices

USE Method (สำหรับ Infrastructure)

สร้างโดย Brendan Gregg สำหรับ Monitor ทุก Resource:

Utilization: ใช้งานเท่าไหร่ (% CPU, % Memory, % Disk)
Saturation: มี Queue/Waiting เท่าไหร่ (Load average, Disk I/O wait)
Errors: มี Error เท่าไหร่ (Disk errors, Network errors)

RED Method (สำหรับ Application/Services)

สร้างโดย Tom Wilkie (Grafana) สำหรับ Monitor Microservices:

Rate: จำนวน Requests ต่อวินาที
Errors: จำนวน Requests ที่ล้มเหลว
Duration: เวลาที่ใช้ในการ Process Request (Latency)

Golden Signals (Google SRE)

จาก Google SRE Book สำหรับ Monitor ทุก Service:

Latency: เวลา Response (ทั้ง Success และ Error)
Traffic: จำนวน Request ที่เข้ามา
Errors: อัตรา Error
Saturation: ความเต็มของ Resource (CPU, Memory, Queue)

# Golden Signals PromQL
# Latency
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))

# Traffic
sum(rate(http_requests_total[5m]))

# Errors
sum(rate(http_requests_total{status_code=~"5.."}[5m]))
  / sum(rate(http_requests_total[5m]))

# Saturation
avg(1 - rate(node_cpu_seconds_total{mode="idle"}[5m]))  # CPU saturation
node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes  # Memory available ratio

SLO, SLI และ Error Budgets

SLI (Service Level Indicator) คือ Metric ที่วัด เช่น Availability, Latency

SLO (Service Level Objective) คือเป้าหมาย เช่น Availability 99.9%

Error Budget คือ "งบ" สำหรับ Failure เช่น 99.9% SLO = 0.1% Error Budget = 43 นาทีต่อเดือน

# SLO PromQL Examples

# Availability SLI (% of successful requests)
sum(rate(http_requests_total{status_code!~"5.."}[30d]))
  / sum(rate(http_requests_total[30d]))

# SLO: 99.9% availability
# Error Budget consumed:
1 - (
  sum(rate(http_requests_total{status_code!~"5.."}[30d]))
  / sum(rate(http_requests_total[30d]))
  - 0.999
) / 0.001

# Latency SLI (% of requests under 500ms)
sum(rate(http_request_duration_seconds_bucket{le="0.5"}[30d]))
  / sum(rate(http_request_duration_seconds_count[30d]))

# Alert เมื่อ Error Budget ใกล้หมด
# ใช้ Multi-window, Multi-burn-rate alerts (Google SRE)

SLO	Downtime ต่อเดือน	Downtime ต่อปี
99%	7.2 ชั่วโมง	3.65 วัน
99.9%	43.8 นาที	8.76 ชั่วโมง
99.95%	21.9 นาที	4.38 ชั่วโมง
99.99%	4.38 นาที	52.6 นาที

สรุป

Prometheus และ Grafana เป็น Stack ที่ทรงพลังที่สุดสำหรับ Monitoring ในปี 2026 ด้วยความสามารถในการ Scrape Metrics จากทุก Infrastructure และ Application ภาษา Query อย่าง PromQL ที่ยืดหยุ่น ระบบ Alerting ที่ครบถ้วน Dashboard ที่สวยงามและ Interactive รวมถึง Ecosystem ที่ครบทั้ง Logs (Loki), Traces (Tempo) และ Long-term Storage (Mimir)

เริ่มต้นจากการติดตั้ง Prometheus + Node Exporter + Grafana เพื่อ Monitor เครื่อง Server ของคุณ จากนั้นเพิ่ม Application Metrics ด้วย Client Libraries ตั้ง Alert Rules สำหรับสิ่งสำคัญ แล้วค่อยๆ เพิ่ม Loki สำหรับ Logs และ Tempo สำหรับ Traces การ Monitor ที่ดีคือกุญแจสำคัญที่ทำให้ระบบ Production มีความเสถียรและแก้ปัญหาได้รวดเร็ว