Segment Routing Metrics
Segment Routing SR-MPLS SRv6 Metric Collection Telemetry Prometheus Grafana Monitoring Network Production
| Metric Category | Examples | Collection Method | Interval |
|---|---|---|---|
| Interface | Bandwidth, Errors, Drops | SNMP / Telemetry | 10-60s |
| SR Policy | Active Path, Segment List | Telemetry / CLI | 30-60s |
| Latency | per-Link, per-Path delay | TWAMP / PM | 10-30s |
| IGP | Adjacency, Cost, Topology | Telemetry / SNMP | Event-driven |
| TE | Tunnel Util, Reserved BW | Telemetry / SNMP | 30-60s |
| PCE | Computation Time, Failures | PCE API / Logs | Per-request |
Telemetry Pipeline
# === Telemetry Collection Pipeline ===
# Architecture:
# Router (gNMI) → Telegraf/Pipeline → Kafka → Prometheus → Grafana
#
# Cisco IOS-XR Telemetry Config:
# telemetry model-driven
# destination-group PROMETHEUS
# address-family ipv4 10.0.0.100 port 57000
# encoding self-describing-gpb
# protocol grpc no-tls
# sensor-group INTERFACE
# sensor-path Cisco-IOS-XR-infra-statsd-oper:infra-statistics/interfaces/interface/latest/generic-counters
# sensor-group SR-POLICY
# sensor-path Cisco-IOS-XR-infra-xtc-agent-oper:xtc/policy-forwardings
# sensor-group IGP
# sensor-path Cisco-IOS-XR-clns-isis-oper:isis/instances/instance/neighbors
# subscription MONITORING
# sensor-group-id INTERFACE sample-interval 10000
# sensor-group-id SR-POLICY sample-interval 30000
# sensor-group-id IGP sample-interval 60000
# destination-id PROMETHEUS
from dataclasses import dataclass
@dataclass
class TelemetrySource:
source: str
protocol: str
yang_model: str
interval: str
data_points: str
sources = [
TelemetrySource("Interface Counters",
"gNMI / MDT (Model-Driven Telemetry)",
"Cisco-IOS-XR-infra-statsd-oper",
"10 วินาที",
"bytes_in/out, packets_in/out, errors, drops"),
TelemetrySource("SR Policy Status",
"gNMI / MDT",
"Cisco-IOS-XR-infra-xtc-agent-oper",
"30 วินาที",
"policy_name, color, endpoint, active_path, segment_list"),
TelemetrySource("ISIS Neighbors",
"gNMI / MDT",
"Cisco-IOS-XR-clns-isis-oper",
"60 วินาที (event-driven)",
"neighbor_id, state, interface, level, hold_time"),
TelemetrySource("Performance Measurement",
"gNMI / MDT",
"Cisco-IOS-XR-perf-meas-oper",
"10 วินาที",
"delay_min/max/avg per link, jitter, loss"),
TelemetrySource("Traffic Flow (sFlow)",
"sFlow v5 / NetFlow v9",
"N/A (packet sampling)",
"1:1000 sampling",
"src/dst IP, port, protocol, bytes, application"),
]
print("=== Telemetry Sources ===")
for s in sources:
print(f" [{s.source}] Protocol: {s.protocol}")
print(f" YANG: {s.yang_model}")
print(f" Interval: {s.interval}")
print(f" Data: {s.data_points}")
Prometheus + Grafana
# === Monitoring Stack ===
# Telegraf config for gNMI input
# [[inputs.gnmi]]
# addresses = ["router1:57400", "router2:57400"]
# username = "admin"
# password = "secret"
# [[inputs.gnmi.subscription]]
# name = "interface"
# origin = "Cisco-IOS-XR-infra-statsd-oper"
# path = "/infra-statistics/interfaces/interface/latest/generic-counters"
# subscription_mode = "sample"
# sample_interval = "10s"
# [[inputs.gnmi.subscription]]
# name = "sr_policy"
# origin = "Cisco-IOS-XR-infra-xtc-agent-oper"
# path = "/xtc/policy-forwardings"
# subscription_mode = "sample"
# sample_interval = "30s"
#
# [[outputs.prometheus_client]]
# listen = ":9273"
# metric_version = 2
# Prometheus scrape config
# scrape_configs:
# - job_name: 'telegraf'
# static_configs:
# - targets: ['telegraf:9273']
# - job_name: 'snmp'
# static_configs:
# - targets: ['snmp-exporter:9116']
# Grafana Alert Rules (example)
# - alert: HighLinkUtilization
# expr: rate(interface_bytes_total[5m]) * 8 / interface_speed > 0.8
# for: 5m
# labels:
# severity: warning
# annotations:
# summary: "Link {{ $labels.interface }} utilization > 80%"
@dataclass
class GrafanaPanel:
panel: str
query: str
visualization: str
alert: str
panels = [
GrafanaPanel("Link Utilization (%)",
"rate(interface_bytes_total[5m]) * 8 / interface_speed * 100",
"Time Series + Threshold lines at 70% 80%",
"> 80% for 5m → Warning, > 90% → Critical"),
GrafanaPanel("SR Policy Status",
"sr_policy_state{state='active'} == 1",
"Stat Panel: Green=Active, Red=Down",
"state != active → Critical Alert"),
GrafanaPanel("Per-Link Latency",
"perf_meas_delay_avg{direction='forward'}",
"Heatmap: Links × Time → Color = Delay",
"> 10ms → Warning, > 50ms → Critical"),
GrafanaPanel("Top 10 Utilized Interfaces",
"topk(10, rate(interface_bytes_total[5m]) * 8)",
"Bar Chart: sorted by utilization",
"N/A (informational)"),
GrafanaPanel("IGP Topology Changes",
"increase(isis_adjacency_changes_total[1h])",
"Time Series: spikes = topology flaps",
"> 5 changes/hr → Warning"),
]
print("=== Grafana Panels ===")
for p in panels:
print(f" [{p.panel}]")
print(f" Query: {p.query}")
print(f" Viz: {p.visualization}")
print(f" Alert: {p.alert}")
Automation
# === Network Automation for SR ===
# Python script to collect SR policy info via SSH
# from netmiko import ConnectHandler
#
# device = {
# 'device_type': 'cisco_xr',
# 'host': '10.0.0.1',
# 'username': 'admin',
# 'password': 'secret',
# }
#
# conn = ConnectHandler(**device)
# output = conn.send_command('show segment-routing traffic-eng policy color 100')
# sr_policies = conn.send_command('show segment-routing traffic-eng policy all')
# perf_data = conn.send_command('show performance-measurement interfaces')
# conn.disconnect()
@dataclass
class AutomationTask:
task: str
tool: str
trigger: str
action: str
tasks = [
AutomationTask("Link Down Response",
"Ansible + Event-driven",
"IGP Adjacency Down event",
"Verify SR Policy reroute สำเร็จ แจ้ง NOC ถ้าไม่สำเร็จ"),
AutomationTask("Capacity Planning",
"Python + Prometheus API",
"Weekly scheduled job",
"ดึง Utilization data วิเคราะห์ Trend แจ้งเมื่อใกล้ Capacity"),
AutomationTask("SR Policy Validation",
"Python + gNMI",
"หลัง Config Change",
"ตรวจว่า SR Policy Active ถูกต้อง Latency ไม่เปลี่ยน"),
AutomationTask("Topology Backup",
"Python + Netconf",
"ทุกวัน 02:00",
"Backup Configuration ทุก Router เก็บ Git"),
AutomationTask("SLA Report",
"Python + Grafana API",
"ทุกเดือน วันที่ 1",
"สร้าง PDF Report แสดง Uptime Latency Utilization"),
]
print("=== Automation Tasks ===")
for t in tasks:
print(f" [{t.task}] Tool: {t.tool}")
print(f" Trigger: {t.trigger}")
print(f" Action: {t.action}")
เคล็ดลับ
- Telemetry: ใช้ Streaming Telemetry แทน SNMP ข้อมูลละเอียดกว่า
- gNMI: ใช้ gNMI สำหรับ Router รุ่นใหม่ ประสิทธิภาพสูงกว่า
- Kafka: ใช้ Kafka เป็น Buffer เมื่อมี Router หลายพันตัว
- Alert: ตั้ง Alert Link Utilization > 80% Latency > Threshold
- Automation: ใช้ Ansible Python ทำ Validation หลัง Config Change
Segment Routing คืออะไร
SR-MPLS SRv6 Segment Label SID Routing Architecture IGP OSPF IS-IS Traffic Engineering Fast Reroute SR-PCE Controller ISP Data Center WAN
Metric อะไรที่ต้องเก็บ
Interface Bandwidth Errors SR Policy Active Path Latency per-Link TWAMP IGP Adjacency Cost TE Tunnel Utilization PCE Computation Time Flow
เก็บ Metric อย่างไร
SNMP Polling 5 นาที Streaming Telemetry gNMI gRPC 10-30 วินาที Netconf YANG sFlow NetFlow Python Script Prometheus InfluxDB Kafka Grafana
Dashboard ทำอย่างไร
Prometheus Time-series Grafana Dashboard Utilization SR Policy Status Latency Heatmap Alert 80% InfluxDB Kafka Elasticsearch Syslog Topology
สรุป
Segment Routing SR-MPLS SRv6 Telemetry gNMI Prometheus Grafana Monitoring Interface Latency SR Policy IGP Automation Production
