SiamCafe.net Blog
Technology

Tailscale Mesh Metric Collection เกบ Metrics จาก Mesh Network ด้วย Prometheus

tailscale mesh metric collection
Tailscale Mesh Metric Collection | SiamCafe Blog
2025-10-10· อ. บอม — SiamCafe.net· 1,522 คำ

Tailscale Mesh Metric Collection ?????????????????????

Tailscale Mesh Metric Collection ?????????????????????????????? metrics ????????? nodes ??????????????????????????? Tailscale mesh network ??????????????? monitor ??????????????????????????? network, ????????????????????????????????? connections, latency ????????????????????? nodes ????????? resource usage ???????????????????????? node ????????????????????????????????????????????????????????????????????? troubleshoot, capacity planning ????????? performance optimization

Metrics ????????????????????????????????????????????????????????? mesh network ?????????????????? Network Metrics ???????????? latency, packet loss, throughput ????????????????????? nodes, Connection Metrics ???????????? direct vs relay connections, handshake time, connection duration, Node Metrics ???????????? CPU, memory, disk usage ???????????????????????? node, Application Metrics ???????????? request rate, error rate, response time ????????? services ????????? run ?????????????????? mesh

Architecture ??????????????????????????????????????? Prometheus ???????????? time-series metrics, Grafana ???????????? dashboards, Alertmanager ????????? alerts ???????????????????????????????????? ???????????????????????????????????????????????????????????? Tailscale mesh ??????????????? scrape metrics ???????????? nodes ??????????????????????????????????????????????????????????????????????????????????????? ports ?????????????????????

????????????????????? Tailscale ????????? Metric Exporters

Setup Tailscale mesh ??????????????? metric exporters

# === Tailscale Mesh Metric Setup ===

# 1. Install Tailscale on all nodes
curl -fsSL https://tailscale.com/install.sh | sh
tailscale up --authkey=tskey-auth-xxxxx --hostname=monitor-node

# 2. Install Node Exporter (system metrics)
wget https://github.com/prometheus/node_exporter/releases/download/v1.7.0/node_exporter-1.7.0.linux-amd64.tar.gz
tar xzf node_exporter-1.7.0.linux-amd64.tar.gz
sudo mv node_exporter-1.7.0.linux-amd64/node_exporter /usr/local/bin/

# Create systemd service
cat > /etc/systemd/system/node_exporter.service << 'EOF'
[Unit]
Description=Node Exporter
After=network.target

[Service]
User=node_exporter
ExecStart=/usr/local/bin/node_exporter \
    --web.listen-address=:9100 \
    --collector.systemd \
    --collector.processes \
    --collector.tcpstat \
    --collector.netstat.fields="^(.*)"
Restart=always

[Install]
WantedBy=multi-user.target
EOF

sudo systemctl enable --now node_exporter

# 3. Install Tailscale Exporter (custom)
cat > /usr/local/bin/tailscale_exporter.py << 'PYEOF'
#!/usr/bin/env python3
"""Tailscale metrics exporter for Prometheus"""
import json
import subprocess
import time
from http.server import HTTPServer, BaseHTTPRequestHandler

class MetricsHandler(BaseHTTPRequestHandler):
    def do_GET(self):
        if self.path != "/metrics":
            self.send_response(404)
            self.end_headers()
            return
        
        metrics = self._collect_metrics()
        self.send_response(200)
        self.send_header("Content-Type", "text/plain")
        self.end_headers()
        self.wfile.write(metrics.encode())
    
    def _collect_metrics(self):
        result = subprocess.run(
            ["tailscale", "status", "--json"],
            capture_output=True, text=True
        )
        data = json.loads(result.stdout)
        
        lines = []
        lines.append("# HELP tailscale_peers_total Total number of peers")
        lines.append("# TYPE tailscale_peers_total gauge")
        
        peers = data.get("Peer", {})
        online = sum(1 for p in peers.values() if p.get("Online"))
        total = len(peers)
        
        lines.append(f'tailscale_peers_total {total}')
        lines.append(f'tailscale_peers_online {online}')
        
        for key, peer in peers.items():
            hostname = peer.get("HostName", "unknown")
            is_online = 1 if peer.get("Online") else 0
            is_direct = 1 if not peer.get("Relay") else 0
            
            lines.append(f'tailscale_peer_online{{hostname="{hostname}"}} {is_online}')
            lines.append(f'tailscale_peer_direct{{hostname="{hostname}"}} {is_direct}')
        
        return "\n".join(lines) + "\n"

if __name__ == "__main__":
    server = HTTPServer(("0.0.0.0", 9191), MetricsHandler)
    print("Tailscale exporter listening on :9191")
    server.serve_forever()
PYEOF

chmod +x /usr/local/bin/tailscale_exporter.py
# Run as systemd service similarly

# 4. Install Blackbox Exporter (probe latency)
wget https://github.com/prometheus/blackbox_exporter/releases/download/v0.25.0/blackbox_exporter-0.25.0.linux-amd64.tar.gz
tar xzf blackbox_exporter-0.25.0.linux-amd64.tar.gz
sudo mv blackbox_exporter-0.25.0.linux-amd64/blackbox_exporter /usr/local/bin/

cat > /etc/blackbox.yml << 'EOF'
modules:
  icmp:
    prober: icmp
    timeout: 5s
  tcp_connect:
    prober: tcp
    timeout: 5s
  http_2xx:
    prober: http
    timeout: 10s
    http:
      valid_status_codes: [200, 301, 302]
      follow_redirects: true
EOF

echo "Metric exporters installed"

??????????????? Metric Collection Pipeline

Configure Prometheus ?????????????????? mesh metric collection

# === Prometheus Configuration ===

cat > /etc/prometheus/prometheus.yml << 'EOF'
global:
  scrape_interval: 15s
  evaluation_interval: 15s
  scrape_timeout: 10s

rule_files:
  - "alerts/*.yml"

alerting:
  alertmanagers:
    - static_configs:
        - targets: ["localhost:9093"]

scrape_configs:
  # Prometheus self-monitoring
  - job_name: "prometheus"
    static_configs:
      - targets: ["localhost:9090"]

  # Node Exporter (system metrics via Tailscale IPs)
  - job_name: "node"
    static_configs:
      - targets:
          - "100.64.0.1:9100"   # cloud-api-01
          - "100.64.0.2:9100"   # cloud-api-02
          - "100.64.0.10:9100"  # edge-bkk-01
          - "100.64.0.11:9100"  # edge-cnx-01
          - "100.64.0.12:9100"  # edge-hdy-01
    relabel_configs:
      - source_labels: [__address__]
        regex: "100.64.0.1:.*"
        target_label: hostname
        replacement: "cloud-api-01"
      - source_labels: [__address__]
        regex: "100.64.0.2:.*"
        target_label: hostname
        replacement: "cloud-api-02"

  # Tailscale Exporter
  - job_name: "tailscale"
    static_configs:
      - targets: ["localhost:9191"]

  # Blackbox Exporter (mesh latency probes)
  - job_name: "mesh_latency"
    metrics_path: /probe
    params:
      module: [icmp]
    static_configs:
      - targets:
          - "100.64.0.1"
          - "100.64.0.2"
          - "100.64.0.10"
          - "100.64.0.11"
          - "100.64.0.12"
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: "localhost:9115"

  # Application Metrics
  - job_name: "applications"
    static_configs:
      - targets:
          - "100.64.0.1:8080"   # API server metrics
          - "100.64.0.2:8080"   # API server metrics
    metrics_path: /metrics
EOF

# Start Prometheus
prometheus --config.file=/etc/prometheus/prometheus.yml \
    --storage.tsdb.retention.time=90d \
    --storage.tsdb.retention.size=50GB

echo "Prometheus configured"

Prometheus ????????? Grafana Dashboard

??????????????? Grafana dashboards ?????????????????? mesh monitoring

#!/usr/bin/env python3
# mesh_dashboard.py ??? Grafana Dashboard Generator
import json
import logging

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger("dashboard")

class GrafanaDashboardBuilder:
    def __init__(self):
        self.panels = []
    
    def mesh_overview_queries(self):
        return {
            "total_nodes": 'count(tailscale_peer_online)',
            "online_nodes": 'count(tailscale_peer_online == 1)',
            "offline_nodes": 'count(tailscale_peer_online == 0)',
            "direct_connections_pct": 'count(tailscale_peer_direct == 1) / count(tailscale_peer_direct) * 100',
            "avg_latency_ms": 'avg(probe_duration_seconds{job="mesh_latency"}) * 1000',
            "p95_latency_ms": 'histogram_quantile(0.95, probe_duration_seconds{job="mesh_latency"}) * 1000',
            "packet_loss_pct": '(1 - avg(probe_success{job="mesh_latency"})) * 100',
        }
    
    def node_health_queries(self):
        return {
            "cpu_usage": '100 - (avg by(hostname) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)',
            "memory_usage": '(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100',
            "disk_usage": '100 - (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"} * 100)',
            "network_rx_rate": 'rate(node_network_receive_bytes_total{device!="lo"}[5m])',
            "network_tx_rate": 'rate(node_network_transmit_bytes_total{device!="lo"}[5m])',
            "load_average": 'node_load1',
            "open_connections": 'node_netstat_Tcp_CurrEstab',
        }
    
    def alert_rules(self):
        return {
            "groups": [
                {
                    "name": "mesh_alerts",
                    "rules": [
                        {
                            "alert": "NodeOffline",
                            "expr": 'tailscale_peer_online == 0',
                            "for": "5m",
                            "labels": {"severity": "critical"},
                            "annotations": {"summary": "Node {{ $labels.hostname }} is offline"},
                        },
                        {
                            "alert": "HighLatency",
                            "expr": 'probe_duration_seconds{job="mesh_latency"} > 0.1',
                            "for": "10m",
                            "labels": {"severity": "warning"},
                            "annotations": {"summary": "High latency to {{ $labels.instance }}: {{ $value }}s"},
                        },
                        {
                            "alert": "RelayConnection",
                            "expr": 'tailscale_peer_direct == 0 and tailscale_peer_online == 1',
                            "for": "15m",
                            "labels": {"severity": "warning"},
                            "annotations": {"summary": "{{ $labels.hostname }} using relay (not direct)"},
                        },
                        {
                            "alert": "HighCPU",
                            "expr": '100 - (avg by(hostname) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 85',
                            "for": "10m",
                            "labels": {"severity": "warning"},
                            "annotations": {"summary": "High CPU on {{ $labels.hostname }}: {{ $value }}%"},
                        },
                        {
                            "alert": "DiskFull",
                            "expr": 'node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"} < 0.1',
                            "for": "5m",
                            "labels": {"severity": "critical"},
                            "annotations": {"summary": "Disk almost full on {{ $labels.hostname }}"},
                        },
                    ],
                },
            ],
        }

builder = GrafanaDashboardBuilder()
overview = builder.mesh_overview_queries()
print("Overview Queries:")
for name, query in overview.items():
    print(f"  {name}: {query}")

alerts = builder.alert_rules()
print(f"\nAlert Rules: {len(alerts['groups'][0]['rules'])} rules defined")
for rule in alerts["groups"][0]["rules"]:
    print(f"  - {rule['alert']}: {rule['labels']['severity']}")

Alerting ????????? Anomaly Detection

????????????????????? alerting ?????????????????? mesh network

# === Alertmanager Configuration ===

cat > /etc/alertmanager/alertmanager.yml << 'EOF'
global:
  resolve_timeout: 5m
  slack_api_url: "https://hooks.slack.com/services/xxx/yyy/zzz"

route:
  receiver: "default"
  group_by: ["alertname", "severity"]
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  routes:
    - match:
        severity: critical
      receiver: "critical-alerts"
      repeat_interval: 1h
    - match:
        severity: warning
      receiver: "warning-alerts"
      repeat_interval: 4h

receivers:
  - name: "default"
    slack_configs:
      - channel: "#monitoring"
        title: "{{ .GroupLabels.alertname }}"
        text: "{{ range .Alerts }}{{ .Annotations.summary }}\n{{ end }}"

  - name: "critical-alerts"
    slack_configs:
      - channel: "#critical-alerts"
        title: "CRITICAL: {{ .GroupLabels.alertname }}"
        text: "{{ range .Alerts }}{{ .Annotations.summary }}\n{{ end }}"
        color: "danger"
    webhook_configs:
      - url: "https://pagerduty.example.com/webhook"

  - name: "warning-alerts"
    slack_configs:
      - channel: "#monitoring"
        title: "WARNING: {{ .GroupLabels.alertname }}"
        text: "{{ range .Alerts }}{{ .Annotations.summary }}\n{{ end }}"
        color: "warning"

inhibit_rules:
  - source_match:
      severity: "critical"
    target_match:
      severity: "warning"
    equal: ["alertname", "hostname"]
EOF

# Recording Rules (pre-compute expensive queries)
cat > /etc/prometheus/alerts/recording.yml << 'EOF'
groups:
  - name: mesh_recording
    interval: 30s
    rules:
      - record: mesh:health_score
        expr: |
          (count(tailscale_peer_online == 1) / count(tailscale_peer_online)) * 100
      
      - record: mesh:avg_latency_ms
        expr: |
          avg(probe_duration_seconds{job="mesh_latency"}) * 1000
      
      - record: node:cpu_usage_pct
        expr: |
          100 - (avg by(hostname) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
      
      - record: node:memory_usage_pct
        expr: |
          (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100
EOF

echo "Alerting configured"

Advanced Monitoring Strategies

Monitoring strategies ?????????????????????

#!/usr/bin/env python3
# advanced_monitoring.py ??? Advanced Mesh Monitoring
import json
import logging
from datetime import datetime
from typing import Dict, List

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger("advanced")

class MeshHealthAnalyzer:
    def __init__(self):
        self.history = []
    
    def full_mesh_test(self, nodes):
        """Test connectivity between all node pairs"""
        results = []
        for i, src in enumerate(nodes):
            for j, dst in enumerate(nodes):
                if i >= j:
                    continue
                results.append({
                    "source": src,
                    "destination": dst,
                    "latency_ms": 5 + (i + j) * 2,
                    "direct": True,
                    "packet_loss_pct": 0,
                })
        return results
    
    def topology_analysis(self):
        return {
            "total_nodes": 8,
            "total_connections": 28,
            "direct_connections": 24,
            "relay_connections": 4,
            "average_hops": 1.1,
            "network_diameter": 2,
            "bottleneck_nodes": ["gateway-01"],
            "isolated_risk": [],
            "recommendations": [
                "gateway-01 handles 60% of traffic - consider adding second gateway",
                "edge-hdy-01 has 4 relay connections - check NAT configuration",
                "Consider subnet routing for edge-cnx cluster",
            ],
        }
    
    def capacity_forecast(self):
        return {
            "current_usage": {
                "bandwidth_avg_mbps": 45,
                "bandwidth_peak_mbps": 120,
                "connections_avg": 850,
                "connections_peak": 1200,
            },
            "forecast_30d": {
                "bandwidth_avg_mbps": 52,
                "bandwidth_peak_mbps": 140,
                "connections_avg": 980,
                "connections_peak": 1400,
            },
            "capacity_limits": {
                "bandwidth_max_mbps": 500,
                "connections_max": 5000,
            },
            "time_to_capacity": {
                "bandwidth": "8 months at current growth",
                "connections": "12 months at current growth",
            },
        }

analyzer = MeshHealthAnalyzer()
nodes = ["cloud-01", "cloud-02", "edge-bkk", "edge-cnx", "edge-hdy"]
mesh_test = analyzer.full_mesh_test(nodes)
print(f"Mesh test: {len(mesh_test)} paths tested")

topology = analyzer.topology_analysis()
print("\nTopology:", json.dumps(topology["recommendations"], indent=2))

forecast = analyzer.capacity_forecast()
print("\nForecast:", json.dumps(forecast["time_to_capacity"], indent=2))

FAQ ??????????????????????????????????????????

Q: Prometheus scrape metrics ???????????? Tailscale ???????????????????????????????

A: ?????????????????????????????? Tailscale ????????? WireGuard encryption ?????????????????? traffic ????????????????????????????????????????????? mesh network Prometheus scrape ???????????? Tailscale IP (100.x.x.x) ???????????????????????????????????????????????? ports ????????????????????? ????????????????????????????????????????????? nodes ??????????????????????????? mesh ???????????????????????? ?????????????????????????????????????????????????????????????????????????????? Tailscale ACLs ???????????????????????? monitoring node ????????????????????????????????? scrape ????????? ???????????? nodes ???????????? access port 9100 ??????????????????

Q: Metric retention ???????????????????????????????????????????

A: ????????????????????? use case 15 ?????????????????? resolution ???????????? 15-30 ????????? ?????????????????? real-time monitoring 1 ???????????? resolution ???????????? 90 ????????? ?????????????????? trend analysis 5 ???????????? resolution ???????????? 1 ?????? ?????????????????? capacity planning ????????? recording rules ?????? resolution ?????????????????? long-term storage ????????????????????? Thanos/Cortex ?????????????????? long-term storage ?????????????????? Prometheus ?????????????????? disk usage 1-2 bytes per sample Prometheus 1000 metrics ????????? 15 ?????????????????? ????????? ~5GB per month

Q: Direct connection ????????? Relay connection ???????????????????????????????????????????

A: Direct connection nodes ????????????????????????????????????????????????????????? WireGuard tunnel latency ????????? (1-5ms ?????? LAN, 10-50ms ???????????? region) throughput ????????? ????????????????????? default ????????? Tailscale ???????????????????????? Relay connection ????????? DERP (Designated Encrypted Relay for Packets) servers ????????? Tailscale ????????????????????????????????? latency ????????????????????? (50-200ms) throughput ????????????????????? ??????????????????????????? NAT traversal ??????????????????????????? ???????????? ???????????????????????? symmetric NAT ??????????????????????????? ???????????? UDP port 41641, ????????? exit node, ???????????????????????? subnet router

Q: Alerting fatigue ????????????????????????????

A: Alert fatigue ????????????????????????????????? alerts ??????????????????????????????????????????????????????????????? ???????????????????????? ???????????? threshold ?????????????????????????????? ???????????? alert ???????????????????????? alert ???????????????????????????????????????????????? action ????????? for clause ?????? Prometheus ????????? alert ???????????????????????? threshold ???????????? ??????????????? persist ?????????????????? ????????? inhibit rules suppress warning ????????????????????? critical ???????????????????????? Group alerts ???????????????????????????????????????????????? ????????????????????????????????????????????????????????????????????? ????????? severity ?????????????????? critical ????????? PagerDuty warning ????????? Slack info ????????????????????? alert review alerts ???????????????????????? ???????????????????????????????????? action

📖 บทความที่เกี่ยวข้อง

Python Click CLI Metric Collectionอ่านบทความ → MySQL Replication Metric Collectionอ่านบทความ → Tailscale Mesh Business Continuityอ่านบทความ → Tailscale Mesh Architecture Design Patternอ่านบทความ → mTLS Service Mesh Metric Collectionอ่านบทความ →

📚 ดูบทความทั้งหมด →