SiamCafe · Blog
SigNoz Observability Edge Deployment — Monitor
บทความ

SigNoz Observability Edge Deployment — Monitor

เผยแพร่ 28 พฤษภาคม 2569

SigNoz Edge Observability

SigNoz Observability Edge Deployment OpenTelemetry ClickHouse Metrics Traces Logs Alerting Edge Node IoT Production

ComponentLocationPurposeResource
OTel CollectorEdge NodeCollect Buffer Export Telemetry100MB RAM 1 vCPU
ApplicationEdge NodeBusiness Logic + OTel SDKDepends on App
SigNoz ServerCentral CloudQuery Dashboard Alert4 vCPU 8GB RAM
ClickHouseCentral CloudLong-term Storage8 vCPU 32GB RAM SSD
Kafka (Optional)Central CloudBuffer between Collector → SigNoz3 Brokers 4GB each

Edge OTel Collector Config

# === OpenTelemetry Collector for Edge ===

# otel-collector-edge.yaml
# receivers:
#   otlp:
#     protocols:
#       grpc: { endpoint: "0.0.0.0:4317" }
#       http: { endpoint: "0.0.0.0:4318" }
#
# processors:
#   batch:
#     send_batch_size: 1000
#     timeout: 30s
#   memory_limiter:
#     check_interval: 5s
#     limit_mib: 100
#   filter/edge:
#     traces:
#       span:
#         - 'attributes["http.status_code"] == 200 and kind == SPAN_KIND_CLIENT'
#   resource:
#     attributes:
#       - key: edge.location
#         value: "factory-bangkok-01"
#         action: upsert
#
# exporters:
#   otlp/signoz:
#     endpoint: "signoz-central.example.com:4317"
#     tls: { insecure: false }
#     retry_on_failure:
#       enabled: true
#       initial_interval: 5s
#       max_interval: 300s
#     sending_queue:
#       enabled: true
#       num_consumers: 2
#       queue_size: 5000
#       storage: file_storage
#   file_storage:
#     directory: /var/otel/buffer
#     timeout: 10s
#
# service:
#   pipelines:
#     traces:
#       receivers: [otlp]
#       processors: [memory_limiter, filter/edge, resource, batch]
#       exporters: [otlp/signoz]

from dataclasses import dataclass

@dataclass
class CollectorConfig:
    component: str
    config_key: str
    value: str
    purpose: str

configs = [
    CollectorConfig("Batch Processor",
        "batch.send_batch_size: 1000, timeout: 30s",
        "รวม Telemetry เป็น Batch ลด Network Call",
        "ลด Bandwidth 60-80% เทียบ Real-time"),
    CollectorConfig("Memory Limiter",
        "memory_limiter.limit_mib: 100",
        "จำกัด Memory ที่ Collector ใช้",
        "ป้องกัน OOM บน Edge Node (Resource จำกัด)"),
    CollectorConfig("Filter Processor",
        "filter: drop healthy spans",
        "กรอง Span ที่ไม่จำเป็น (200 OK Client)",
        "ลด Volume 30-50% เก็บเฉพาะ Error/Slow"),
    CollectorConfig("Resource Attribute",
        "resource.attributes: edge.location",
        "เพิ่ม Edge Location Label ทุก Telemetry",
        "Query Filter ตาม Location ใน SigNoz"),
    CollectorConfig("Retry + Queue",
        "retry_on_failure + sending_queue + file_storage",
        "Buffer ใน Disk เมื่อ Offline Retry เมื่อ Online",
        "ไม่สูญ Telemetry เมื่อ Network ขาด"),
]

print("=== Edge Collector Config ===")
for c in configs:
    print(f"  [{c.component}]")
    print(f"    Config: {c.config_key}")
    print(f"    Value: {c.value}")
    print(f"    Purpose: {c.purpose}")

SigNoz Dashboard

# === SigNoz Dashboard for Edge Monitoring ===

@dataclass
class DashPanel:
    panel: str
    query: str
    viz: str
    alert: str

panels = [
    DashPanel("Edge Node Status Map",
        "count by (edge.location) where last_seen > now()-5m",
        "Map/Table แสดง Online/Offline ต่อ Location",
        "Offline > 5m → P1 Alert"),
    DashPanel("Request Latency per Edge",
        "P99(duration) group by edge.location",
        "Heatmap Latency per Location per Hour",
        "P99 > 2s → P2 Warning"),
    DashPanel("Error Rate per Edge",
        "count(status=ERROR) / count(*) group by edge.location",
        "Bar Chart % Error per Location",
        "> 5% → P2 Warning > 10% → P1"),
    DashPanel("Throughput per Edge",
        "rate(span_count) group by edge.location",
        "Time Series req/min per Location",
        "< 1 req/min → Check Edge Health"),
    DashPanel("Collector Buffer Usage",
        "otelcol_exporter_queue_size / queue_capacity",
        "Gauge % Buffer Full per Edge",
        "> 80% → P2 Network Issue"),
    DashPanel("Resource Usage per Edge",
        "system.cpu.utilization, system.memory.usage",
        "Multi-line CPU RAM per Edge Node",
        "CPU > 90% or RAM > 85% → P2"),
]

print("=== SigNoz Dashboard Panels ===")
for p in panels:
    print(f"  [{p.panel}]")
    print(f"    Query: {p.query}")
    print(f"    Viz: {p.viz}")
    print(f"    Alert: {p.alert}")

Scaling & Production

# === Production Scaling ===

@dataclass
class ScaleTier:
    tier: str
    edge_nodes: str
    central_sizing: str
    storage: str
    features: str

tiers = [
    ScaleTier("Small",
        "1-10 Edge Nodes",
        "SigNoz: 2 vCPU 4GB | ClickHouse: 4 vCPU 16GB",
        "100GB SSD (30 days retention)",
        "Basic Dashboard Alerts Email"),
    ScaleTier("Medium",
        "10-50 Edge Nodes",
        "SigNoz: 4 vCPU 8GB | ClickHouse: 8 vCPU 32GB",
        "500GB SSD (60 days retention)",
        "Multi-location Dashboard Sampling PagerDuty"),
    ScaleTier("Large",
        "50-200 Edge Nodes",
        "SigNoz: 8 vCPU 16GB HA | ClickHouse Cluster 3 nodes",
        "2TB SSD (90 days retention)",
        "Kafka Buffer Tail Sampling Multi-team RBAC"),
    ScaleTier("Enterprise",
        "200+ Edge Nodes",
        "SigNoz HA + LB | ClickHouse Sharded Cluster",
        "10TB+ SSD (1 year retention)",
        "Multi-region Custom Retention Compliance Audit"),
]

print("=== Scaling Tiers ===")
for t in tiers:
    print(f"  [{t.tier}] Edge Nodes: {t.edge_nodes}")
    print(f"    Central: {t.central_sizing}")
    print(f"    Storage: {t.storage}")
    print(f"    Features: {t.features}")

เคล็ดลับ

  • Buffer: ใช้ File Storage Buffer ป้องกันสูญ Telemetry เมื่อ Offline
  • Filter: กรอง Healthy Spans ออก ลด Bandwidth 30-50%
  • Batch: ตั้ง Batch Size 1000 + Timeout 30s ลด Network Call
  • Location: เพิ่ม edge.location Label ทุก Telemetry สำหรับ Filter
  • Sampling: ใช้ Tail Sampling เก็บเฉพาะ Error/Slow Traces

SigNoz คืออะไร

Open Source Observability Metrics Traces Logs OpenTelemetry ClickHouse Go React Dashboard Alert Self-hosted Datadog Alternative Free