SigNoz Observability Edge Deployment — Monitor
SigNoz Edge Observability
SigNoz Observability Edge Deployment OpenTelemetry ClickHouse Metrics Traces Logs Alerting Edge Node IoT Production
| Component | Location | Purpose | Resource |
|---|---|---|---|
| OTel Collector | Edge Node | Collect Buffer Export Telemetry | 100MB RAM 1 vCPU |
| Application | Edge Node | Business Logic + OTel SDK | Depends on App |
| SigNoz Server | Central Cloud | Query Dashboard Alert | 4 vCPU 8GB RAM |
| ClickHouse | Central Cloud | Long-term Storage | 8 vCPU 32GB RAM SSD |
| Kafka (Optional) | Central Cloud | Buffer between Collector → SigNoz | 3 Brokers 4GB each |
Edge OTel Collector Config
# === OpenTelemetry Collector for Edge ===
# otel-collector-edge.yaml
# receivers:
# otlp:
# protocols:
# grpc: { endpoint: "0.0.0.0:4317" }
# http: { endpoint: "0.0.0.0:4318" }
#
# processors:
# batch:
# send_batch_size: 1000
# timeout: 30s
# memory_limiter:
# check_interval: 5s
# limit_mib: 100
# filter/edge:
# traces:
# span:
# - 'attributes["http.status_code"] == 200 and kind == SPAN_KIND_CLIENT'
# resource:
# attributes:
# - key: edge.location
# value: "factory-bangkok-01"
# action: upsert
#
# exporters:
# otlp/signoz:
# endpoint: "signoz-central.example.com:4317"
# tls: { insecure: false }
# retry_on_failure:
# enabled: true
# initial_interval: 5s
# max_interval: 300s
# sending_queue:
# enabled: true
# num_consumers: 2
# queue_size: 5000
# storage: file_storage
# file_storage:
# directory: /var/otel/buffer
# timeout: 10s
#
# service:
# pipelines:
# traces:
# receivers: [otlp]
# processors: [memory_limiter, filter/edge, resource, batch]
# exporters: [otlp/signoz]
from dataclasses import dataclass
@dataclass
class CollectorConfig:
component: str
config_key: str
value: str
purpose: str
configs = [
CollectorConfig("Batch Processor",
"batch.send_batch_size: 1000, timeout: 30s",
"รวม Telemetry เป็น Batch ลด Network Call",
"ลด Bandwidth 60-80% เทียบ Real-time"),
CollectorConfig("Memory Limiter",
"memory_limiter.limit_mib: 100",
"จำกัด Memory ที่ Collector ใช้",
"ป้องกัน OOM บน Edge Node (Resource จำกัด)"),
CollectorConfig("Filter Processor",
"filter: drop healthy spans",
"กรอง Span ที่ไม่จำเป็น (200 OK Client)",
"ลด Volume 30-50% เก็บเฉพาะ Error/Slow"),
CollectorConfig("Resource Attribute",
"resource.attributes: edge.location",
"เพิ่ม Edge Location Label ทุก Telemetry",
"Query Filter ตาม Location ใน SigNoz"),
CollectorConfig("Retry + Queue",
"retry_on_failure + sending_queue + file_storage",
"Buffer ใน Disk เมื่อ Offline Retry เมื่อ Online",
"ไม่สูญ Telemetry เมื่อ Network ขาด"),
]
print("=== Edge Collector Config ===")
for c in configs:
print(f" [{c.component}]")
print(f" Config: {c.config_key}")
print(f" Value: {c.value}")
print(f" Purpose: {c.purpose}")
SigNoz Dashboard
# === SigNoz Dashboard for Edge Monitoring ===
@dataclass
class DashPanel:
panel: str
query: str
viz: str
alert: str
panels = [
DashPanel("Edge Node Status Map",
"count by (edge.location) where last_seen > now()-5m",
"Map/Table แสดง Online/Offline ต่อ Location",
"Offline > 5m → P1 Alert"),
DashPanel("Request Latency per Edge",
"P99(duration) group by edge.location",
"Heatmap Latency per Location per Hour",
"P99 > 2s → P2 Warning"),
DashPanel("Error Rate per Edge",
"count(status=ERROR) / count(*) group by edge.location",
"Bar Chart % Error per Location",
"> 5% → P2 Warning > 10% → P1"),
DashPanel("Throughput per Edge",
"rate(span_count) group by edge.location",
"Time Series req/min per Location",
"< 1 req/min → Check Edge Health"),
DashPanel("Collector Buffer Usage",
"otelcol_exporter_queue_size / queue_capacity",
"Gauge % Buffer Full per Edge",
"> 80% → P2 Network Issue"),
DashPanel("Resource Usage per Edge",
"system.cpu.utilization, system.memory.usage",
"Multi-line CPU RAM per Edge Node",
"CPU > 90% or RAM > 85% → P2"),
]
print("=== SigNoz Dashboard Panels ===")
for p in panels:
print(f" [{p.panel}]")
print(f" Query: {p.query}")
print(f" Viz: {p.viz}")
print(f" Alert: {p.alert}")
Scaling & Production
# === Production Scaling ===
@dataclass
class ScaleTier:
tier: str
edge_nodes: str
central_sizing: str
storage: str
features: str
tiers = [
ScaleTier("Small",
"1-10 Edge Nodes",
"SigNoz: 2 vCPU 4GB | ClickHouse: 4 vCPU 16GB",
"100GB SSD (30 days retention)",
"Basic Dashboard Alerts Email"),
ScaleTier("Medium",
"10-50 Edge Nodes",
"SigNoz: 4 vCPU 8GB | ClickHouse: 8 vCPU 32GB",
"500GB SSD (60 days retention)",
"Multi-location Dashboard Sampling PagerDuty"),
ScaleTier("Large",
"50-200 Edge Nodes",
"SigNoz: 8 vCPU 16GB HA | ClickHouse Cluster 3 nodes",
"2TB SSD (90 days retention)",
"Kafka Buffer Tail Sampling Multi-team RBAC"),
ScaleTier("Enterprise",
"200+ Edge Nodes",
"SigNoz HA + LB | ClickHouse Sharded Cluster",
"10TB+ SSD (1 year retention)",
"Multi-region Custom Retention Compliance Audit"),
]
print("=== Scaling Tiers ===")
for t in tiers:
print(f" [{t.tier}] Edge Nodes: {t.edge_nodes}")
print(f" Central: {t.central_sizing}")
print(f" Storage: {t.storage}")
print(f" Features: {t.features}")
เคล็ดลับ
- Buffer: ใช้ File Storage Buffer ป้องกันสูญ Telemetry เมื่อ Offline
- Filter: กรอง Healthy Spans ออก ลด Bandwidth 30-50%
- Batch: ตั้ง Batch Size 1000 + Timeout 30s ลด Network Call
- Location: เพิ่ม edge.location Label ทุก Telemetry สำหรับ Filter
- Sampling: ใช้ Tail Sampling เก็บเฉพาะ Error/Slow Traces
SigNoz คืออะไร
Open Source Observability Metrics Traces Logs OpenTelemetry ClickHouse Go React Dashboard Alert Self-hosted Datadog Alternative Free