SigNoz Edge Observability
SigNoz Observability Edge Deployment OpenTelemetry ClickHouse Metrics Traces Logs Alerting Edge Node IoT Production
| Component | Location | Purpose | Resource |
|---|---|---|---|
| OTel Collector | Edge Node | Collect Buffer Export Telemetry | 100MB RAM 1 vCPU |
| Application | Edge Node | Business Logic + OTel SDK | Depends on App |
| SigNoz Server | Central Cloud | Query Dashboard Alert | 4 vCPU 8GB RAM |
| ClickHouse | Central Cloud | Long-term Storage | 8 vCPU 32GB RAM SSD |
| Kafka (Optional) | Central Cloud | Buffer between Collector → SigNoz | 3 Brokers 4GB each |
Edge OTel Collector Config
# === OpenTelemetry Collector for Edge ===
# otel-collector-edge.yaml
# receivers:
# otlp:
# protocols:
# grpc: { endpoint: "0.0.0.0:4317" }
# http: { endpoint: "0.0.0.0:4318" }
#
# processors:
# batch:
# send_batch_size: 1000
# timeout: 30s
# memory_limiter:
# check_interval: 5s
# limit_mib: 100
# filter/edge:
# traces:
# span:
# - 'attributes["http.status_code"] == 200 and kind == SPAN_KIND_CLIENT'
# resource:
# attributes:
# - key: edge.location
# value: "factory-bangkok-01"
# action: upsert
#
# exporters:
# otlp/signoz:
# endpoint: "signoz-central.example.com:4317"
# tls: { insecure: false }
# retry_on_failure:
# enabled: true
# initial_interval: 5s
# max_interval: 300s
# sending_queue:
# enabled: true
# num_consumers: 2
# queue_size: 5000
# storage: file_storage
# file_storage:
# directory: /var/otel/buffer
# timeout: 10s
#
# service:
# pipelines:
# traces:
# receivers: [otlp]
# processors: [memory_limiter, filter/edge, resource, batch]
# exporters: [otlp/signoz]
from dataclasses import dataclass
@dataclass
class CollectorConfig:
component: str
config_key: str
value: str
purpose: str
configs = [
CollectorConfig("Batch Processor",
"batch.send_batch_size: 1000, timeout: 30s",
"รวม Telemetry เป็น Batch ลด Network Call",
"ลด Bandwidth 60-80% เทียบ Real-time"),
CollectorConfig("Memory Limiter",
"memory_limiter.limit_mib: 100",
"จำกัด Memory ที่ Collector ใช้",
"ป้องกัน OOM บน Edge Node (Resource จำกัด)"),
CollectorConfig("Filter Processor",
"filter: drop healthy spans",
"กรอง Span ที่ไม่จำเป็น (200 OK Client)",
"ลด Volume 30-50% เก็บเฉพาะ Error/Slow"),
CollectorConfig("Resource Attribute",
"resource.attributes: edge.location",
"เพิ่ม Edge Location Label ทุก Telemetry",
"Query Filter ตาม Location ใน SigNoz"),
CollectorConfig("Retry + Queue",
"retry_on_failure + sending_queue + file_storage",
"Buffer ใน Disk เมื่อ Offline Retry เมื่อ Online",
"ไม่สูญ Telemetry เมื่อ Network ขาด"),
]
print("=== Edge Collector Config ===")
for c in configs:
print(f" [{c.component}]")
print(f" Config: {c.config_key}")
print(f" Value: {c.value}")
print(f" Purpose: {c.purpose}")
SigNoz Dashboard
# === SigNoz Dashboard for Edge Monitoring ===
@dataclass
class DashPanel:
panel: str
query: str
viz: str
alert: str
panels = [
DashPanel("Edge Node Status Map",
"count by (edge.location) where last_seen > now()-5m",
"Map/Table แสดง Online/Offline ต่อ Location",
"Offline > 5m → P1 Alert"),
DashPanel("Request Latency per Edge",
"P99(duration) group by edge.location",
"Heatmap Latency per Location per Hour",
"P99 > 2s → P2 Warning"),
DashPanel("Error Rate per Edge",
"count(status=ERROR) / count(*) group by edge.location",
"Bar Chart % Error per Location",
"> 5% → P2 Warning > 10% → P1"),
DashPanel("Throughput per Edge",
"rate(span_count) group by edge.location",
"Time Series req/min per Location",
"< 1 req/min → Check Edge Health"),
DashPanel("Collector Buffer Usage",
"otelcol_exporter_queue_size / queue_capacity",
"Gauge % Buffer Full per Edge",
"> 80% → P2 Network Issue"),
DashPanel("Resource Usage per Edge",
"system.cpu.utilization, system.memory.usage",
"Multi-line CPU RAM per Edge Node",
"CPU > 90% or RAM > 85% → P2"),
]
print("=== SigNoz Dashboard Panels ===")
for p in panels:
print(f" [{p.panel}]")
print(f" Query: {p.query}")
print(f" Viz: {p.viz}")
print(f" Alert: {p.alert}")
Scaling & Production
# === Production Scaling ===
@dataclass
class ScaleTier:
tier: str
edge_nodes: str
central_sizing: str
storage: str
features: str
tiers = [
ScaleTier("Small",
"1-10 Edge Nodes",
"SigNoz: 2 vCPU 4GB | ClickHouse: 4 vCPU 16GB",
"100GB SSD (30 days retention)",
"Basic Dashboard Alerts Email"),
ScaleTier("Medium",
"10-50 Edge Nodes",
"SigNoz: 4 vCPU 8GB | ClickHouse: 8 vCPU 32GB",
"500GB SSD (60 days retention)",
"Multi-location Dashboard Sampling PagerDuty"),
ScaleTier("Large",
"50-200 Edge Nodes",
"SigNoz: 8 vCPU 16GB HA | ClickHouse Cluster 3 nodes",
"2TB SSD (90 days retention)",
"Kafka Buffer Tail Sampling Multi-team RBAC"),
ScaleTier("Enterprise",
"200+ Edge Nodes",
"SigNoz HA + LB | ClickHouse Sharded Cluster",
"10TB+ SSD (1 year retention)",
"Multi-region Custom Retention Compliance Audit"),
]
print("=== Scaling Tiers ===")
for t in tiers:
print(f" [{t.tier}] Edge Nodes: {t.edge_nodes}")
print(f" Central: {t.central_sizing}")
print(f" Storage: {t.storage}")
print(f" Features: {t.features}")
เคล็ดลับ
- Buffer: ใช้ File Storage Buffer ป้องกันสูญ Telemetry เมื่อ Offline
- Filter: กรอง Healthy Spans ออก ลด Bandwidth 30-50%
- Batch: ตั้ง Batch Size 1000 + Timeout 30s ลด Network Call
- Location: เพิ่ม edge.location Label ทุก Telemetry สำหรับ Filter
- Sampling: ใช้ Tail Sampling เก็บเฉพาะ Error/Slow Traces
SigNoz คืออะไร
Open Source Observability Metrics Traces Logs OpenTelemetry ClickHouse Go React Dashboard Alert Self-hosted Datadog Alternative Free
Edge Deployment คืออะไร
Deploy ที่ Edge ใกล้ User IoT CDN Retail Telecom Offline Resource จำกัด Network ไม่เสถียร Buffer Store-and-forward
Architecture ออกแบบอย่างไร
Edge OTel Collector Buffer Filter Batch Central SigNoz ClickHouse Kafka Retry Queue File Storage Sampling Compression
Alerting ตั้งอย่างไร
Edge Offline Latency Error Rate Resource CPU RAM Network Collector Buffer Slack PagerDuty P1 P2 P3 Runbook Dashboard Location
สรุป
SigNoz Observability Edge Deployment OpenTelemetry Collector Buffer ClickHouse Metrics Traces Logs Alert Dashboard Production
