Grafana Dashboard Monitoring — คู่มือสร้าง

ทำไมต้อง Grafana สำหรับ Monitoring

ผมผ่าน monitoring tools มาแทบทุกตัวตั้งแต่ MRTG, Cacti, Nagios, Zabbix จนถึง Grafana ที่ผมใช้เป็นหลักในปัจจุบันสิ่งที่ทำให้ Grafana ชนะคือความยืดหยุ่นในการสร้าง dashboard ที่สวยงามและ informative ได้มากกว่า monitoring tool อื่นๆ

Grafana เป็น open-source visualization platform ที่ไม่ได้เก็บ data เองแต่เชื่อมต่อกับ data sources หลากหลายเช่น Prometheus, InfluxDB, Elasticsearch, MySQL, PostgreSQL, Loki และอีกมากมายทำให้คุณสามารถดูข้อมูลจากหลาย sources บน dashboard เดียวได้ตัวอย่างเช่นผมมี dashboard ที่แสดง server metrics จาก Prometheus, application logs จาก Loki และ business metrics จาก PostgreSQL ทั้งหมดบนหน้าจอเดียว

Grafana Open Source vs Grafana Cloud

Grafana มี 2 แบบ Grafana OSS (Open Source) ติดตั้งบน server เราเองฟรีไม่มีข้อจำกัดแต่ต้องดูแลเอง Grafana Cloud เป็น managed service มี free tier ให้ 10,000 metrics, 50 GB logs, 50 GB traces ต่อเดือนซึ่งเพียงพอสำหรับทีมเล็กผมใช้ OSS สำหรับ production เพราะข้อมูล monitoring เป็น sensitive data ไม่อยากส่งออกไปข้างนอก

ติดตั้ง Grafana + Prometheus Stack

ผมจะแสดงการติดตั้ง full monitoring stack ที่ประกอบด้วย Prometheus เก็บ metrics, Node Exporter เก็บ system metrics จาก server, Grafana แสดงผลและ Alertmanager จัดการ alerts

ติดตั้งด้วย Docker Compose

# docker-compose.yml — Full Monitoring Stack
services:
 prometheus:
 image: prom/prometheus:v2.51.0
 volumes:
 - ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
 - ./prometheus/rules:/etc/prometheus/rules
 - prometheus-data:/prometheus
 command:
 - '--config.file=/etc/prometheus/prometheus.yml'
 - '--storage.tsdb.path=/prometheus'
 - '--storage.tsdb.retention.time=30d'
 - '--web.enable-lifecycle'
 ports:
 - "9090:9090"
 restart: unless-stopped

 grafana:
 image: grafana/grafana:10.4.0
 volumes:
 - grafana-data:/var/lib/grafana
 - ./grafana/provisioning:/etc/grafana/provisioning
 environment:
 GF_SECURITY_ADMIN_USER: admin
 GF_SECURITY_ADMIN_PASSWORD: SecurePassword123
 GF_USERS_ALLOW_SIGN_UP: "false"
 GF_SERVER_ROOT_URL: https://grafana.example.com
 GF_SMTP_ENABLED: "true"
 GF_SMTP_HOST: smtp.gmail.com:587
 GF_SMTP_USER: alerts@example.com
 GF_SMTP_PASSWORD: app-password
 ports:
 - "3000:3000"
 depends_on:
 - prometheus
 restart: unless-stopped

 alertmanager:
 image: prom/alertmanager:v0.27.0
 volumes:
 - ./alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml
 ports:
 - "9093:9093"
 restart: unless-stopped

 node-exporter:
 image: prom/node-exporter:v1.7.0
 volumes:
 - /proc:/host/proc:ro
 - /sys:/host/sys:ro
 - /:/rootfs:ro
 command:
 - '--path.procfs=/host/proc'
 - '--path.sysfs=/host/sys'
 - '--path.rootfs=/rootfs'
 - '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
 ports:
 - "9100:9100"
 restart: unless-stopped

volumes:
 prometheus-data:
 grafana-data:

Prometheus Configuration

# prometheus/prometheus.yml
global:
 scrape_interval: 15s
 evaluation_interval: 15s

alerting:
 alertmanagers:
 - static_configs:
 - targets: ['alertmanager:9093']

rule_files:
 - /etc/prometheus/rules/*.yml

scrape_configs:
 - job_name: 'prometheus'
 static_configs:
 - targets: ['localhost:9090']

 - job_name: 'node-exporter'
 static_configs:
 - targets:
 - 'node-exporter:9100'
 - '10.10.10.12:9100'
 - '10.10.10.13:9100'
 labels:
 env: production

 - job_name: 'nginx'
 static_configs:
 - targets: ['nginx-exporter:9113']

 - job_name: 'mysql'
 static_configs:
 - targets: ['mysqld-exporter:9104']

 - job_name: 'docker'
 static_configs:
 - targets: ['cadvisor:8080']

ติดตั้งแบบ Native Package

# ติดตั้ง Grafana บน Ubuntu
apt install -y apt-transport-https software-properties-common wget
wget -q -O /usr/share/keyrings/grafana.key https://apt.grafana.com/gpg.key
echo "deb [signed-by=/usr/share/keyrings/grafana.key] https://apt.grafana.com stable main" | \
 tee /etc/apt/sources.list.d/grafana.list
apt update && apt install grafana

# เปิดใช้งาน
systemctl enable --now grafana-server

# เข้าถึงที่ http://server-ip:3000
# Default login: admin / admin (เปลี่ยนทันที!)

Prometheus Data Source

# Grafana Provisioning — auto-configure data sources
# grafana/provisioning/datasources/prometheus.yml
apiVersion: 1
datasources:
 - name: Prometheus
 type: prometheus
 access: proxy
 url: http://prometheus:9090
 isDefault: true
 editable: false

 - name: Loki
 type: loki
 access: proxy
 url: http://loki:3100

 - name: InfluxDB
 type: influxdb
 access: proxy
 url: http://influxdb:8086
 database: telegraf
 user: grafana
 secureJsonData:
 password: grafana-password

MySQL Data Source สำหรับ Business Metrics

# เชื่อมต่อ MySQL เพื่อดู business metrics
# เช่น จำนวน users, orders, revenue
# grafana/provisioning/datasources/mysql.yml
apiVersion: 1
datasources:
 - name: MySQL-Business
 type: mysql
 url: mysql-server:3306
 database: production
 user: grafana_reader
 secureJsonData:
 password: readonly-password

# สำหรับ MySQL performance monitoring
# ดูบทความ MySQL Performance Tuning ของผม

สำหรับการ tune MySQL Performance ผมเขียนไว้อีกบทความแนะนำให้อ่านคู่กัน

สร้าง Dashboard จากศูนย์

ผมจะสอนสร้าง Server Monitoring Dashboard ที่ผมใช้จริงแสดง CPU, Memory, Disk, Network ของ server ทั้งหมด

Panel Types ที่ใช้บ่อย

Time Series — กราฟเส้นแสดง metrics ตาม time เหมาะสำหรับ CPU usage, request rate, response time Gauge — แสดงค่าปัจจุบันเป็นเข็มวัดเหมาะสำหรับ disk usage %, memory usage % Stat — แสดงตัวเลขเดียวเช่น uptime, total requests, error count Table — แสดงข้อมูลเป็นตารางเหมาะสำหรับ top processes, slow queries Logs — แสดง log entries จาก Loki Heatmap — แสดง distribution เช่น request latency distribution

ตัวอย่าง Dashboard Panels

# Panel 1: CPU Usage per Server (Time Series)
# PromQL:
100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Panel 2: Memory Usage % (Gauge)
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100

# Panel 3: Disk Usage % (Gauge)
100 - ((node_filesystem_avail_bytes{mountpoint="/", fstype!="rootfs"} * 100) / node_filesystem_size_bytes{mountpoint="/", fstype!="rootfs"})

# Panel 4: Network Traffic (Time Series)
# Inbound
rate(node_network_receive_bytes_total{device="eth0"}[5m]) * 8
# Outbound
rate(node_network_transmit_bytes_total{device="eth0"}[5m]) * 8

# Panel 5: System Load (Time Series)
node_load1
node_load5
node_load15

# Panel 6: Disk I/O (Time Series)
rate(node_disk_read_bytes_total[5m])
rate(node_disk_written_bytes_total[5m])

# Panel 7: Open File Descriptors (Stat)
node_filefd_allocated

# Panel 8: Uptime (Stat)
time() - node_boot_time_seconds

PromQL Queries ที่ใช้บ่อย

PromQL เป็นภาษา query ของ Prometheus ที่ต้องเรียนรู้ให้ดีผมรวบรวม queries ที่ผมใช้บ่อยที่สุดมาให้

Server Metrics

# CPU usage เฉลี่ย 5 นาที
100 - (avg by(instance)(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Memory usage ใน GB
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / 1024^3

# Disk space remaining ใน GB
node_filesystem_avail_bytes{mountpoint="/"} / 1024^3

# TCP connections by state
node_netstat_Tcp_CurrEstab

# Process count
node_procs_running

Application Metrics (HTTP)

# Request rate per second
rate(http_requests_total[5m])

# Error rate (5xx)
rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) * 100

# Response time 95th percentile
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

# Response time 99th percentile
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))

# Requests per endpoint
sum by(handler)(rate(http_requests_total[5m]))

Docker/Container Metrics

# Container CPU usage
rate(container_cpu_usage_seconds_total{name!=""}[5m]) * 100

# Container Memory usage
container_memory_usage_bytes{name!=""} / 1024^2

# Container Network I/O
rate(container_network_receive_bytes_total{name!=""}[5m])
rate(container_network_transmit_bytes_total{name!=""}[5m])

# Container restart count
increase(container_restart_count{name!=""}[1h])

Alerting — แจ้งเตือนอัตโนมัติ

Dashboard สวยแค่ไหนถ้าไม่มี alerting ก็ไม่มีประโยชน์เพราะไม่มีใครนั่งดู dashboard 24 ชั่วโมงผมตั้ง alerts สำหรับเหตุการณ์สำคัญทุกอย่าง

Prometheus Alert Rules

# prometheus/rules/server-alerts.yml
groups:
 - name: server-alerts
 rules:
 - alert: HighCpuUsage
 expr: 100 - (avg by(instance)(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 85
 for: 5m
 labels:
 severity: warning
 annotations:
 summary: "High CPU usage on {{ $labels.instance }}"
 description: "CPU usage is {{ $value | printf \"%.1f\" }}% for 5 minutes"

 - alert: HighMemoryUsage
 expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 90
 for: 5m
 labels:
 severity: critical
 annotations:
 summary: "High memory usage on {{ $labels.instance }}"
 description: "Memory usage is {{ $value | printf \"%.1f\" }}%"

 - alert: DiskSpaceLow
 expr: (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 < 15
 for: 10m
 labels:
 severity: critical
 annotations:
 summary: "Disk space low on {{ $labels.instance }}"
 description: "Only {{ $value | printf \"%.1f\" }}% disk space remaining"

 - alert: InstanceDown
 expr: up == 0
 for: 2m
 labels:
 severity: critical
 annotations:
 summary: "Instance {{ $labels.instance }} is down"

Alertmanager Configuration

# alertmanager/alertmanager.yml
global:
 resolve_timeout: 5m
 smtp_smarthost: 'smtp.gmail.com:587'
 smtp_from: 'alerts@example.com'
 smtp_auth_username: 'alerts@example.com'
 smtp_auth_password: 'app-password'

route:
 group_by: ['alertname', 'instance']
 group_wait: 30s
 group_interval: 5m
 repeat_interval: 4h
 receiver: 'default'
 routes:
 - match:
 severity: critical
 receiver: 'critical-alerts'
 repeat_interval: 1h

receivers:
 - name: 'default'
 email_configs:
 - to: 'admin@example.com'

 - name: 'critical-alerts'
 email_configs:
 - to: 'admin@example.com'
 slack_configs:
 - api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK'
 channel: '#alerts-critical'
 title: '{{ .CommonAnnotations.summary }}'
 text: '{{ .CommonAnnotations.description }}'

1. ใช้ Variables สำหรับ Filtering

สร้าง Variable ใน Dashboard Settings

Name: instance

Type: Query

Data source: Prometheus

Query: label_values(node_cpu_seconds_total, instance)

Multi-value: true

Include All option: true

ใช้ Variable ใน Panel Query

100 - (avg by(instance)(rate(node_cpu_seconds_total{mode="idle", instance=~"$instance"}[5m])) * 100)

ทำให้ user เลือกดู server ที่ต้องการได้

2. จัดกลุ่ม Panels เป็น Rows

ผมจัด dashboard เป็น rows ดังนี้ Overview Row แสดง stat panels สำหรับ total servers, uptime, alert count CPU/Memory Row แสดง time series ของ CPU และ Memory ทุกเครื่อง Disk Row แสดง gauge ของ disk usage และ I/O graphs Network Row แสดง bandwidth, connections, errors Application Row แสดง request rate, error rate, latency

3. Import Community Dashboards

Dashboard ที่ผมแนะนำจาก grafana.com/grafana/dashboards/

ID: 1860 — Node Exporter Full (ดีมาก สมบูรณ์ที่สุด)

ID: 893 — Docker and System Monitoring

ID: 7362 — MySQL Overview

ID: 12708 — Nginx Ingress Controller

ID: 13659 — Blackbox Exporter (HTTP probe)

วิธี Import: Dashboards > Import > ใส่ Dashboard ID > Load > Select data source > Import

4. Dashboard as Code

# ใช้ Grafana Provisioning สำหรับ version control dashboards
# grafana/provisioning/dashboards/dashboards.yml
apiVersion: 1
providers:
 - name: 'default'
 orgId: 1
 folder: 'Provisioned'
 type: file
 updateIntervalSeconds: 30
 options:
 path: /etc/grafana/provisioning/dashboards/json
 foldersFromFilesStructure: true

# Export dashboard เป็น JSON จาก UI
# แล้ววางใน provisioning/dashboards/json/
# ทุกครั้งที่ Grafana restart จะ load dashboard อัตโนมัติ
# เก็บ JSON files ใน Git repo เพื่อ version control

Grafana กับ Zabbix ต่างกันยังไงใช้ตัวไหนดี?

Zabbix เป็น all-in-one monitoring ที่เก็บ data, alert และแสดงผลในตัวเหมาะกับ traditional infrastructure monitoring Grafana เป็น visualization layer ที่ต้องใช้คู่กับ data source อื่น (Prometheus, InfluxDB) ข้อดีคือ dashboard สวยกว่าและ flexible กว่ามากผมใช้ทั้งคู่ Zabbix สำหรับ network device monitoring (SNMP) และ Grafana+Prometheus สำหรับ server/application monitoring

Prometheus กิน disk เยอะมากทำยังไง?

ตั้ง retention period ให้เหมาะสมผมตั้ง 30 วันสำหรับ high-resolution data ถ้าต้องการเก็บ data นานกว่านั้นใช้ Thanos หรือ Cortex เป็น long-term storage ลด scrape interval จาก 15s เป็น 30s สำหรับ targets ที่ไม่ต้องการ real-time data และ drop metrics ที่ไม่จำเป็นด้วย relabel_configs

Grafana หลุดบ่อย login session หมดอายุทำยังไง?

ตั้งค่า session timeout ใน grafana.ini หรือ environment variables GF_AUTH_LOGIN_MAXIMUM_INACTIVE_LIFETIME_DURATION=7d และ GF_AUTH_LOGIN_MAXIMUM_LIFETIME_DURATION=30d สำหรับ TV mode ที่แสดง dashboard บน monitor ตลอดเวลาใช้ GF_AUTH_ANONYMOUS_ENABLED=true พร้อม GF_AUTH_ANONYMOUS_ORG_ROLE=Viewer

มี data source อื่นที่น่าสนใจนอกจาก Prometheus?

InfluxDB เหมาะสำหรับ IoT data และ time-series ที่มี high write throughput Loki เก็บ logs แบบ cost-effective ดูผ่าน Grafana ได้เลย Tempo สำหรับ distributed tracing PostgreSQL/MySQL สำหรับ business metrics ผมใช้ทั้งหมดนี้บน Grafana ตัวเดียว

สรุป

Grafana เป็น visualization platform ที่ดีที่สุดในตลาด open-source ตอนนี้ไม่มีตัวไหนเทียบได้เรื่องความสวยงามและความยืดหยุ่นของ dashboard การใช้ร่วมกับ Prometheus สำหรับ metrics, Loki สำหรับ logs และ Tempo สำหรับ traces ทำให้ได้ full observability stack ที่ทรงพลัง

สิ่งสำคัญคืออย่าสร้าง dashboard มากเกินไปจนไม่มีใครดูเลือก metrics ที่สำคัญจริงๆตั้ง alerting ให้ครอบคลุมและใช้ Dashboard as Code เพื่อ version control ผมเชื่อว่าระบบ monitoring ที่ดีช่วยลดเวลา troubleshooting ได้มากกว่า 80%