Tailscale Mesh Metric Collection ?????????????????????
Tailscale Mesh Metric Collection ?????????????????????????????? metrics ????????? nodes ??????????????????????????? Tailscale mesh network ??????????????? monitor ??????????????????????????? network, ????????????????????????????????? connections, latency ????????????????????? nodes ????????? resource usage ???????????????????????? node ????????????????????????????????????????????????????????????????????? troubleshoot, capacity planning ????????? performance optimization
Metrics ????????????????????????????????????????????????????????? mesh network ?????????????????? Network Metrics ???????????? latency, packet loss, throughput ????????????????????? nodes, Connection Metrics ???????????? direct vs relay connections, handshake time, connection duration, Node Metrics ???????????? CPU, memory, disk usage ???????????????????????? node, Application Metrics ???????????? request rate, error rate, response time ????????? services ????????? run ?????????????????? mesh
Architecture ??????????????????????????????????????? Prometheus ???????????? time-series metrics, Grafana ???????????? dashboards, Alertmanager ????????? alerts ???????????????????????????????????? ???????????????????????????????????????????????????????????? Tailscale mesh ??????????????? scrape metrics ???????????? nodes ??????????????????????????????????????????????????????????????????????????????????????? ports ?????????????????????
????????????????????? Tailscale ????????? Metric Exporters
Setup Tailscale mesh ??????????????? metric exporters
# === Tailscale Mesh Metric Setup ===
# 1. Install Tailscale on all nodes
curl -fsSL https://tailscale.com/install.sh | sh
tailscale up --authkey=tskey-auth-xxxxx --hostname=monitor-node
# 2. Install Node Exporter (system metrics)
wget https://github.com/prometheus/node_exporter/releases/download/v1.7.0/node_exporter-1.7.0.linux-amd64.tar.gz
tar xzf node_exporter-1.7.0.linux-amd64.tar.gz
sudo mv node_exporter-1.7.0.linux-amd64/node_exporter /usr/local/bin/
# Create systemd service
cat > /etc/systemd/system/node_exporter.service << 'EOF'
[Unit]
Description=Node Exporter
After=network.target
[Service]
User=node_exporter
ExecStart=/usr/local/bin/node_exporter \
--web.listen-address=:9100 \
--collector.systemd \
--collector.processes \
--collector.tcpstat \
--collector.netstat.fields="^(.*)"
Restart=always
[Install]
WantedBy=multi-user.target
EOF
sudo systemctl enable --now node_exporter
# 3. Install Tailscale Exporter (custom)
cat > /usr/local/bin/tailscale_exporter.py << 'PYEOF'
#!/usr/bin/env python3
"""Tailscale metrics exporter for Prometheus"""
import json
import subprocess
import time
from http.server import HTTPServer, BaseHTTPRequestHandler
class MetricsHandler(BaseHTTPRequestHandler):
def do_GET(self):
if self.path != "/metrics":
self.send_response(404)
self.end_headers()
return
metrics = self._collect_metrics()
self.send_response(200)
self.send_header("Content-Type", "text/plain")
self.end_headers()
self.wfile.write(metrics.encode())
def _collect_metrics(self):
result = subprocess.run(
["tailscale", "status", "--json"],
capture_output=True, text=True
)
data = json.loads(result.stdout)
lines = []
lines.append("# HELP tailscale_peers_total Total number of peers")
lines.append("# TYPE tailscale_peers_total gauge")
peers = data.get("Peer", {})
online = sum(1 for p in peers.values() if p.get("Online"))
total = len(peers)
lines.append(f'tailscale_peers_total {total}')
lines.append(f'tailscale_peers_online {online}')
for key, peer in peers.items():
hostname = peer.get("HostName", "unknown")
is_online = 1 if peer.get("Online") else 0
is_direct = 1 if not peer.get("Relay") else 0
lines.append(f'tailscale_peer_online{{hostname="{hostname}"}} {is_online}')
lines.append(f'tailscale_peer_direct{{hostname="{hostname}"}} {is_direct}')
return "\n".join(lines) + "\n"
if __name__ == "__main__":
server = HTTPServer(("0.0.0.0", 9191), MetricsHandler)
print("Tailscale exporter listening on :9191")
server.serve_forever()
PYEOF
chmod +x /usr/local/bin/tailscale_exporter.py
# Run as systemd service similarly
# 4. Install Blackbox Exporter (probe latency)
wget https://github.com/prometheus/blackbox_exporter/releases/download/v0.25.0/blackbox_exporter-0.25.0.linux-amd64.tar.gz
tar xzf blackbox_exporter-0.25.0.linux-amd64.tar.gz
sudo mv blackbox_exporter-0.25.0.linux-amd64/blackbox_exporter /usr/local/bin/
cat > /etc/blackbox.yml << 'EOF'
modules:
icmp:
prober: icmp
timeout: 5s
tcp_connect:
prober: tcp
timeout: 5s
http_2xx:
prober: http
timeout: 10s
http:
valid_status_codes: [200, 301, 302]
follow_redirects: true
EOF
echo "Metric exporters installed"
??????????????? Metric Collection Pipeline
Configure Prometheus ?????????????????? mesh metric collection
# === Prometheus Configuration ===
cat > /etc/prometheus/prometheus.yml << 'EOF'
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_timeout: 10s
rule_files:
- "alerts/*.yml"
alerting:
alertmanagers:
- static_configs:
- targets: ["localhost:9093"]
scrape_configs:
# Prometheus self-monitoring
- job_name: "prometheus"
static_configs:
- targets: ["localhost:9090"]
# Node Exporter (system metrics via Tailscale IPs)
- job_name: "node"
static_configs:
- targets:
- "100.64.0.1:9100" # cloud-api-01
- "100.64.0.2:9100" # cloud-api-02
- "100.64.0.10:9100" # edge-bkk-01
- "100.64.0.11:9100" # edge-cnx-01
- "100.64.0.12:9100" # edge-hdy-01
relabel_configs:
- source_labels: [__address__]
regex: "100.64.0.1:.*"
target_label: hostname
replacement: "cloud-api-01"
- source_labels: [__address__]
regex: "100.64.0.2:.*"
target_label: hostname
replacement: "cloud-api-02"
# Tailscale Exporter
- job_name: "tailscale"
static_configs:
- targets: ["localhost:9191"]
# Blackbox Exporter (mesh latency probes)
- job_name: "mesh_latency"
metrics_path: /probe
params:
module: [icmp]
static_configs:
- targets:
- "100.64.0.1"
- "100.64.0.2"
- "100.64.0.10"
- "100.64.0.11"
- "100.64.0.12"
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: "localhost:9115"
# Application Metrics
- job_name: "applications"
static_configs:
- targets:
- "100.64.0.1:8080" # API server metrics
- "100.64.0.2:8080" # API server metrics
metrics_path: /metrics
EOF
# Start Prometheus
prometheus --config.file=/etc/prometheus/prometheus.yml \
--storage.tsdb.retention.time=90d \
--storage.tsdb.retention.size=50GB
echo "Prometheus configured"
Prometheus ????????? Grafana Dashboard
??????????????? Grafana dashboards ?????????????????? mesh monitoring
#!/usr/bin/env python3
# mesh_dashboard.py ??? Grafana Dashboard Generator
import json
import logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger("dashboard")
class GrafanaDashboardBuilder:
def __init__(self):
self.panels = []
def mesh_overview_queries(self):
return {
"total_nodes": 'count(tailscale_peer_online)',
"online_nodes": 'count(tailscale_peer_online == 1)',
"offline_nodes": 'count(tailscale_peer_online == 0)',
"direct_connections_pct": 'count(tailscale_peer_direct == 1) / count(tailscale_peer_direct) * 100',
"avg_latency_ms": 'avg(probe_duration_seconds{job="mesh_latency"}) * 1000',
"p95_latency_ms": 'histogram_quantile(0.95, probe_duration_seconds{job="mesh_latency"}) * 1000',
"packet_loss_pct": '(1 - avg(probe_success{job="mesh_latency"})) * 100',
}
def node_health_queries(self):
return {
"cpu_usage": '100 - (avg by(hostname) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)',
"memory_usage": '(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100',
"disk_usage": '100 - (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"} * 100)',
"network_rx_rate": 'rate(node_network_receive_bytes_total{device!="lo"}[5m])',
"network_tx_rate": 'rate(node_network_transmit_bytes_total{device!="lo"}[5m])',
"load_average": 'node_load1',
"open_connections": 'node_netstat_Tcp_CurrEstab',
}
def alert_rules(self):
return {
"groups": [
{
"name": "mesh_alerts",
"rules": [
{
"alert": "NodeOffline",
"expr": 'tailscale_peer_online == 0',
"for": "5m",
"labels": {"severity": "critical"},
"annotations": {"summary": "Node {{ $labels.hostname }} is offline"},
},
{
"alert": "HighLatency",
"expr": 'probe_duration_seconds{job="mesh_latency"} > 0.1',
"for": "10m",
"labels": {"severity": "warning"},
"annotations": {"summary": "High latency to {{ $labels.instance }}: {{ $value }}s"},
},
{
"alert": "RelayConnection",
"expr": 'tailscale_peer_direct == 0 and tailscale_peer_online == 1',
"for": "15m",
"labels": {"severity": "warning"},
"annotations": {"summary": "{{ $labels.hostname }} using relay (not direct)"},
},
{
"alert": "HighCPU",
"expr": '100 - (avg by(hostname) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 85',
"for": "10m",
"labels": {"severity": "warning"},
"annotations": {"summary": "High CPU on {{ $labels.hostname }}: {{ $value }}%"},
},
{
"alert": "DiskFull",
"expr": 'node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"} < 0.1',
"for": "5m",
"labels": {"severity": "critical"},
"annotations": {"summary": "Disk almost full on {{ $labels.hostname }}"},
},
],
},
],
}
builder = GrafanaDashboardBuilder()
overview = builder.mesh_overview_queries()
print("Overview Queries:")
for name, query in overview.items():
print(f" {name}: {query}")
alerts = builder.alert_rules()
print(f"\nAlert Rules: {len(alerts['groups'][0]['rules'])} rules defined")
for rule in alerts["groups"][0]["rules"]:
print(f" - {rule['alert']}: {rule['labels']['severity']}")
Alerting ????????? Anomaly Detection
????????????????????? alerting ?????????????????? mesh network
# === Alertmanager Configuration ===
cat > /etc/alertmanager/alertmanager.yml << 'EOF'
global:
resolve_timeout: 5m
slack_api_url: "https://hooks.slack.com/services/xxx/yyy/zzz"
route:
receiver: "default"
group_by: ["alertname", "severity"]
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
routes:
- match:
severity: critical
receiver: "critical-alerts"
repeat_interval: 1h
- match:
severity: warning
receiver: "warning-alerts"
repeat_interval: 4h
receivers:
- name: "default"
slack_configs:
- channel: "#monitoring"
title: "{{ .GroupLabels.alertname }}"
text: "{{ range .Alerts }}{{ .Annotations.summary }}\n{{ end }}"
- name: "critical-alerts"
slack_configs:
- channel: "#critical-alerts"
title: "CRITICAL: {{ .GroupLabels.alertname }}"
text: "{{ range .Alerts }}{{ .Annotations.summary }}\n{{ end }}"
color: "danger"
webhook_configs:
- url: "https://pagerduty.example.com/webhook"
- name: "warning-alerts"
slack_configs:
- channel: "#monitoring"
title: "WARNING: {{ .GroupLabels.alertname }}"
text: "{{ range .Alerts }}{{ .Annotations.summary }}\n{{ end }}"
color: "warning"
inhibit_rules:
- source_match:
severity: "critical"
target_match:
severity: "warning"
equal: ["alertname", "hostname"]
EOF
# Recording Rules (pre-compute expensive queries)
cat > /etc/prometheus/alerts/recording.yml << 'EOF'
groups:
- name: mesh_recording
interval: 30s
rules:
- record: mesh:health_score
expr: |
(count(tailscale_peer_online == 1) / count(tailscale_peer_online)) * 100
- record: mesh:avg_latency_ms
expr: |
avg(probe_duration_seconds{job="mesh_latency"}) * 1000
- record: node:cpu_usage_pct
expr: |
100 - (avg by(hostname) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
- record: node:memory_usage_pct
expr: |
(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100
EOF
echo "Alerting configured"
Advanced Monitoring Strategies
Monitoring strategies ?????????????????????
#!/usr/bin/env python3
# advanced_monitoring.py ??? Advanced Mesh Monitoring
import json
import logging
from datetime import datetime
from typing import Dict, List
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger("advanced")
class MeshHealthAnalyzer:
def __init__(self):
self.history = []
def full_mesh_test(self, nodes):
"""Test connectivity between all node pairs"""
results = []
for i, src in enumerate(nodes):
for j, dst in enumerate(nodes):
if i >= j:
continue
results.append({
"source": src,
"destination": dst,
"latency_ms": 5 + (i + j) * 2,
"direct": True,
"packet_loss_pct": 0,
})
return results
def topology_analysis(self):
return {
"total_nodes": 8,
"total_connections": 28,
"direct_connections": 24,
"relay_connections": 4,
"average_hops": 1.1,
"network_diameter": 2,
"bottleneck_nodes": ["gateway-01"],
"isolated_risk": [],
"recommendations": [
"gateway-01 handles 60% of traffic - consider adding second gateway",
"edge-hdy-01 has 4 relay connections - check NAT configuration",
"Consider subnet routing for edge-cnx cluster",
],
}
def capacity_forecast(self):
return {
"current_usage": {
"bandwidth_avg_mbps": 45,
"bandwidth_peak_mbps": 120,
"connections_avg": 850,
"connections_peak": 1200,
},
"forecast_30d": {
"bandwidth_avg_mbps": 52,
"bandwidth_peak_mbps": 140,
"connections_avg": 980,
"connections_peak": 1400,
},
"capacity_limits": {
"bandwidth_max_mbps": 500,
"connections_max": 5000,
},
"time_to_capacity": {
"bandwidth": "8 months at current growth",
"connections": "12 months at current growth",
},
}
analyzer = MeshHealthAnalyzer()
nodes = ["cloud-01", "cloud-02", "edge-bkk", "edge-cnx", "edge-hdy"]
mesh_test = analyzer.full_mesh_test(nodes)
print(f"Mesh test: {len(mesh_test)} paths tested")
topology = analyzer.topology_analysis()
print("\nTopology:", json.dumps(topology["recommendations"], indent=2))
forecast = analyzer.capacity_forecast()
print("\nForecast:", json.dumps(forecast["time_to_capacity"], indent=2))
FAQ ??????????????????????????????????????????
Q: Prometheus scrape metrics ???????????? Tailscale ???????????????????????????????
A: ?????????????????????????????? Tailscale ????????? WireGuard encryption ?????????????????? traffic ????????????????????????????????????????????? mesh network Prometheus scrape ???????????? Tailscale IP (100.x.x.x) ???????????????????????????????????????????????? ports ????????????????????? ????????????????????????????????????????????? nodes ??????????????????????????? mesh ???????????????????????? ?????????????????????????????????????????????????????????????????????????????? Tailscale ACLs ???????????????????????? monitoring node ????????????????????????????????? scrape ????????? ???????????? nodes ???????????? access port 9100 ??????????????????
Q: Metric retention ???????????????????????????????????????????
A: ????????????????????? use case 15 ?????????????????? resolution ???????????? 15-30 ????????? ?????????????????? real-time monitoring 1 ???????????? resolution ???????????? 90 ????????? ?????????????????? trend analysis 5 ???????????? resolution ???????????? 1 ?????? ?????????????????? capacity planning ????????? recording rules ?????? resolution ?????????????????? long-term storage ????????????????????? Thanos/Cortex ?????????????????? long-term storage ?????????????????? Prometheus ?????????????????? disk usage 1-2 bytes per sample Prometheus 1000 metrics ????????? 15 ?????????????????? ????????? ~5GB per month
Q: Direct connection ????????? Relay connection ???????????????????????????????????????????
A: Direct connection nodes ????????????????????????????????????????????????????????? WireGuard tunnel latency ????????? (1-5ms ?????? LAN, 10-50ms ???????????? region) throughput ????????? ????????????????????? default ????????? Tailscale ???????????????????????? Relay connection ????????? DERP (Designated Encrypted Relay for Packets) servers ????????? Tailscale ????????????????????????????????? latency ????????????????????? (50-200ms) throughput ????????????????????? ??????????????????????????? NAT traversal ??????????????????????????? ???????????? ???????????????????????? symmetric NAT ??????????????????????????? ???????????? UDP port 41641, ????????? exit node, ???????????????????????? subnet router
Q: Alerting fatigue ????????????????????????????
A: Alert fatigue ????????????????????????????????? alerts ??????????????????????????????????????????????????????????????? ???????????????????????? ???????????? threshold ?????????????????????????????? ???????????? alert ???????????????????????? alert ???????????????????????????????????????????????? action ????????? for clause ?????? Prometheus ????????? alert ???????????????????????? threshold ???????????? ??????????????? persist ?????????????????? ????????? inhibit rules suppress warning ????????????????????? critical ???????????????????????? Group alerts ???????????????????????????????????????????????? ????????????????????????????????????????????????????????????????????? ????????? severity ?????????????????? critical ????????? PagerDuty warning ????????? Slack info ????????????????????? alert review alerts ???????????????????????? ???????????????????????????????????? action
