Technology

Distributed Tracing Distributed System ตดตาม Requests ขาม Microservices

distributed tracing distributed system
Distributed Tracing Distributed System | SiamCafe Blog
2025-12-11· อ. บอม — SiamCafe.net· 1,339 คำ

Distributed Tracing ?????????????????????

Distributed Tracing ????????????????????????????????????????????????????????? request ?????????????????????????????????????????? services ?????? distributed system ???????????????????????????????????????????????????????????? request lifecycle ??????????????????????????????????????????????????????????????????????????? ???????????? bottlenecks, errors ????????? latency ????????????????????? service

Concepts ???????????? Trace ????????? journey ?????????????????????????????? request ???????????? services, Span ????????? unit of work ??????????????? service ??????????????? (?????? start time, duration, status), Context Propagation ??????????????????????????? trace context (trace ID, span ID) ???????????? service boundaries, Baggage ????????? key-value pairs ????????????????????????????????? trace context

???????????????????????? Distributed Tracing Monolithic app ????????? logging + profiler ????????? ????????? Microservices request ???????????????????????? services ?????????????????? tracing ??????????????? Debug ????????? error ????????????????????? service ?????????, ????????? latency ??????????????? service (???????????????????????????), ?????????????????? dependencies ????????????????????? services, Capacity planning ?????????????????? service ????????????????????? bottleneck, Root cause analysis ?????? root cause ????????????????????????????????? 10 ????????????

????????????????????? OpenTelemetry ?????????????????? Distributed Tracing

Setup OpenTelemetry Collector ????????? instrumentation

# === OpenTelemetry Setup ===

# 1. OpenTelemetry Collector Configuration
cat > otel-collector-config.yaml << 'EOF'
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    timeout: 5s
    send_batch_size: 1024
  
  tail_sampling:
    decision_wait: 10s
    policies:
      - name: errors
        type: status_code
        status_code: {status_codes: [ERROR]}
      - name: slow-traces
        type: latency
        latency: {threshold_ms: 1000}
      - name: sample-rest
        type: probabilistic
        probabilistic: {sampling_percentage: 10}

  resource:
    attributes:
      - key: environment
        value: production
        action: upsert

exporters:
  otlp/jaeger:
    endpoint: jaeger:4317
    tls:
      insecure: true
  
  otlp/tempo:
    endpoint: tempo:4317
    tls:
      insecure: true

  prometheus:
    endpoint: 0.0.0.0:8889

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch, tail_sampling, resource]
      exporters: [otlp/tempo]
    metrics:
      receivers: [otlp]
      processors: [batch]
      exporters: [prometheus]
EOF

# 2. Docker Compose
cat > docker-compose.yaml << 'EOF'
version: '3.8'
services:
  otel-collector:
    image: otel/opentelemetry-collector-contrib:0.96.0
    command: ["--config=/etc/otel-collector-config.yaml"]
    volumes:
      - ./otel-collector-config.yaml:/etc/otel-collector-config.yaml
    ports:
      - "4317:4317"   # OTLP gRPC
      - "4318:4318"   # OTLP HTTP
      - "8889:8889"   # Prometheus metrics

  tempo:
    image: grafana/tempo:2.4.0
    command: ["-config.file=/etc/tempo.yaml"]
    volumes:
      - ./tempo.yaml:/etc/tempo.yaml
      - tempo-data:/var/tempo
    ports:
      - "3200:3200"   # Tempo API

  grafana:
    image: grafana/grafana:10.4.0
    ports:
      - "3000:3000"
    environment:
      - GF_AUTH_ANONYMOUS_ENABLED=true
    volumes:
      - grafana-data:/var/lib/grafana

volumes:
  tempo-data:
  grafana-data:
EOF

# 3. Tempo configuration
cat > tempo.yaml << 'EOF'
server:
  http_listen_port: 3200

distributor:
  receivers:
    otlp:
      protocols:
        grpc:

storage:
  trace:
    backend: local
    local:
      path: /var/tempo/traces
    wal:
      path: /var/tempo/wal

metrics_generator:
  registry:
    external_labels:
      source: tempo
  storage:
    path: /var/tempo/generator/wal
EOF

echo "OpenTelemetry infrastructure configured"

Implement Tracing ?????? Microservices

??????????????? tracing ?????? Python ????????? Node.js services

#!/usr/bin/env python3
# tracing_service.py ??? Python Service with OpenTelemetry Tracing
import logging
from typing import Dict

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger("tracing")

# OpenTelemetry setup
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.resources import Resource
from opentelemetry.instrumentation.flask import FlaskInstrumentor
from opentelemetry.instrumentation.requests import RequestsInstrumentor
from opentelemetry.instrumentation.sqlalchemy import SQLAlchemyInstrumentor
from opentelemetry.propagate import set_global_textmap
from opentelemetry.propagators.b3 import B3MultiFormat

# Configure tracer
resource = Resource.create({
    "service.name": "order-service",
    "service.version": "1.2.0",
    "deployment.environment": "production",
})

provider = TracerProvider(resource=resource)
exporter = OTLPSpanExporter(endpoint="http://otel-collector:4317", insecure=True)
provider.add_span_processor(BatchSpanProcessor(exporter))
trace.set_tracer_provider(provider)

# Set propagator
set_global_textmap(B3MultiFormat())

tracer = trace.get_tracer(__name__)

# Auto-instrument libraries
# FlaskInstrumentor().instrument()
# RequestsInstrumentor().instrument()
# SQLAlchemyInstrumentor().instrument()

class OrderService:
    """Order service with distributed tracing"""
    
    def __init__(self):
        self.orders = {}
    
    def create_order(self, order_data: Dict):
        """Create order with tracing spans"""
        with tracer.start_as_current_span("create_order") as span:
            span.set_attribute("order.customer_id", order_data.get("customer_id", ""))
            span.set_attribute("order.total", order_data.get("total", 0))
            
            # Step 1: Validate order
            with tracer.start_as_current_span("validate_order"):
                self._validate(order_data)
            
            # Step 2: Check inventory
            with tracer.start_as_current_span("check_inventory") as inv_span:
                inv_span.set_attribute("inventory.items_count", len(order_data.get("items", [])))
                available = self._check_inventory(order_data.get("items", []))
                inv_span.set_attribute("inventory.all_available", available)
            
            # Step 3: Process payment
            with tracer.start_as_current_span("process_payment") as pay_span:
                pay_span.set_attribute("payment.method", order_data.get("payment_method", "credit_card"))
                pay_span.set_attribute("payment.amount", order_data.get("total", 0))
                payment_result = self._process_payment(order_data)
                pay_span.set_attribute("payment.status", payment_result["status"])
            
            # Step 4: Save to database
            with tracer.start_as_current_span("save_to_db") as db_span:
                db_span.set_attribute("db.system", "postgresql")
                db_span.set_attribute("db.operation", "INSERT")
                order_id = self._save_order(order_data)
                db_span.set_attribute("order.id", order_id)
            
            # Step 5: Send notification
            with tracer.start_as_current_span("send_notification") as notif_span:
                notif_span.set_attribute("notification.type", "email")
                self._send_notification(order_id, order_data)
            
            span.set_attribute("order.id", order_id)
            span.set_status(trace.StatusCode.OK)
            
            return {"order_id": order_id, "status": "created"}
    
    def _validate(self, data):
        if not data.get("customer_id"):
            raise ValueError("Missing customer_id")
    
    def _check_inventory(self, items):
        return True
    
    def _process_payment(self, data):
        return {"status": "success", "transaction_id": "txn_123"}
    
    def _save_order(self, data):
        order_id = f"ORD-{len(self.orders) + 1:06d}"
        self.orders[order_id] = data
        return order_id
    
    def _send_notification(self, order_id, data):
        logger.info(f"Notification sent for {order_id}")

# Demo
service = OrderService()
result = service.create_order({
    "customer_id": "C001",
    "items": [{"product": "Widget", "qty": 2}],
    "total": 1500.0,
    "payment_method": "credit_card",
})
print(f"Order created: {result}")
print(f"Trace exported to OpenTelemetry Collector")

Jaeger ????????? Tempo ?????????????????? Trace Storage

??????????????????????????????????????????????????????????????? trace backends

# === Trace Backend Comparison and Setup ===

# 1. Jaeger on Kubernetes
cat > jaeger-k8s.yaml << 'EOF'
apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
  name: production-jaeger
  namespace: observability
spec:
  strategy: production
  
  collector:
    replicas: 2
    resources:
      requests:
        cpu: "500m"
        memory: "512Mi"
    options:
      collector:
        num-workers: 50
        queue-size: 2000
  
  query:
    replicas: 2
    resources:
      requests:
        cpu: "250m"
        memory: "256Mi"
  
  storage:
    type: elasticsearch
    elasticsearch:
      nodeCount: 3
      resources:
        requests:
          cpu: "1"
          memory: "2Gi"
      storage:
        size: 100Gi
    
    esIndexCleaner:
      enabled: true
      numberOfDays: 14
      schedule: "55 23 * * *"
---
# Grafana Tempo on Kubernetes (alternative)
apiVersion: v1
kind: ConfigMap
metadata:
  name: tempo-config
  namespace: observability
data:
  tempo.yaml: |
    server:
      http_listen_port: 3200
    distributor:
      receivers:
        otlp:
          protocols:
            grpc:
    storage:
      trace:
        backend: s3
        s3:
          bucket: tempo-traces
          endpoint: s3.amazonaws.com
          region: ap-southeast-1
        wal:
          path: /var/tempo/wal
        block:
          bloom_filter_false_positive: 0.05
    compactor:
      compaction:
        block_retention: 336h  # 14 days
    metrics_generator:
      registry:
        external_labels:
          source: tempo
          cluster: production
EOF

# 2. Comparison script
cat > compare_backends.py << 'PYTHON'
#!/usr/bin/env python3
import json

backends = {
    "jaeger": {
        "storage": ["Elasticsearch", "Cassandra", "Kafka", "Badger (local)"],
        "query_language": "Custom (Jaeger UI)",
        "cost": "Medium-High (Elasticsearch cluster)",
        "scale": "Good (with ES cluster)",
        "pros": ["Mature ecosystem", "Rich UI", "Dependency graph", "Service map"],
        "cons": ["ES cluster overhead", "Complex to operate"],
        "best_for": "Teams already using Elasticsearch",
    },
    "grafana_tempo": {
        "storage": ["S3", "GCS", "Azure Blob", "Local"],
        "query_language": "TraceQL",
        "cost": "Low (object storage)",
        "scale": "Excellent (object storage scales infinitely)",
        "pros": ["Cheapest storage (S3)", "TraceQL powerful", "Grafana integration", "Simple to operate"],
        "cons": ["No indexing (search by trace ID)", "Newer project"],
        "best_for": "Cost-conscious teams using Grafana stack",
    },
    "zipkin": {
        "storage": ["Elasticsearch", "Cassandra", "MySQL", "In-memory"],
        "query_language": "Custom (Zipkin UI)",
        "cost": "Low-Medium",
        "scale": "Good",
        "pros": ["Simple", "Lightweight", "Wide language support"],
        "cons": ["Less features than Jaeger", "Smaller community"],
        "best_for": "Simple setups, small teams",
    },
}

print("Trace Backend Comparison:")
for name, info in backends.items():
    print(f"\n  {name}:")
    print(f"    Storage: {', '.join(info['storage'][:3])}")
    print(f"    Cost: {info['cost']}")
    print(f"    Best for: {info['best_for']}")
PYTHON

python3 compare_backends.py
echo "Backend comparison complete"

Advanced Tracing Patterns

Patterns ??????????????????????????????????????? distributed tracing

#!/usr/bin/env python3
# advanced_tracing.py ??? Advanced Distributed Tracing Patterns
import json
import logging
from typing import Dict, List

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger("advanced")

class AdvancedTracingPatterns:
    def __init__(self):
        pass
    
    def patterns(self):
        return {
            "context_propagation": {
                "description": "????????? trace context ???????????? service boundaries",
                "protocols": ["W3C TraceContext (standard)", "B3 (Zipkin)", "Jaeger propagation"],
                "headers": {
                    "w3c": "traceparent: 00-{trace_id}-{span_id}-{flags}",
                    "b3": "X-B3-TraceId, X-B3-SpanId, X-B3-Sampled",
                },
                "transport": ["HTTP headers", "gRPC metadata", "Kafka headers", "AMQP properties"],
            },
            "tail_sampling": {
                "description": "???????????????????????? sample ????????????????????? trace complete",
                "benefit": "???????????? 100% ????????? error traces, slow traces",
                "strategy": [
                    "Always sample errors (status_code = ERROR)",
                    "Always sample slow traces (> 1s)",
                    "Sample 10% of normal traces",
                    "Always sample specific endpoints (/checkout, /payment)",
                ],
                "implementation": "OpenTelemetry Collector tail_sampling processor",
            },
            "span_links": {
                "description": "?????????????????? spans ??????????????????????????? parent-child relationship",
                "use_cases": [
                    "Batch processing (link producer span to consumer span)",
                    "Fan-out/fan-in patterns",
                    "Async workflows (queue-based)",
                ],
            },
            "trace_context_in_logs": {
                "description": "??????????????? trace_id/span_id ?????? application logs",
                "benefit": "Correlate logs ????????? traces (click ????????? trace ?????? log)",
                "implementation": "OpenTelemetry log bridge ???????????? manual injection",
                "format": '{"message": "Order created", "trace_id": "abc123", "span_id": "def456"}',
            },
            "service_mesh_integration": {
                "description": "????????? Istio/Linkerd ??????????????? traces ??????????????????????????????????????? code",
                "benefit": "Auto-instrument ????????? service ?????? mesh",
                "limitation": "??????????????????????????? network-level spans ????????????????????? business logic",
                "recommendation": "?????????????????????????????? application-level instrumentation",
            },
        }
    
    def best_practices(self):
        return {
            "naming": {
                "span_names": "????????? low-cardinality names ???????????? 'GET /api/orders' ?????????????????? 'GET /api/orders/12345'",
                "service_names": "????????????????????? consistent ????????? deployment (order-service ?????????????????? OrderSvc)",
            },
            "attributes": {
                "required": ["service.name", "service.version", "deployment.environment"],
                "recommended": ["http.method", "http.url", "http.status_code", "db.system", "db.operation"],
                "custom": "??????????????? business attributes ???????????? order.id, customer.tier, payment.method",
            },
            "sampling": {
                "development": "100% sampling",
                "staging": "100% sampling",
                "production": "Tail sampling (100% errors + 10% normal)",
            },
            "retention": {
                "hot": "7 ????????? (fast query)",
                "warm": "30 ????????? (slower query)",
                "archive": "90 ????????? (compliance)",
            },
        }

patterns = AdvancedTracingPatterns()
all_patterns = patterns.patterns()
print("Advanced Tracing Patterns:")
for name, info in all_patterns.items():
    print(f"\n  {name}: {info['description']}")

bp = patterns.best_practices()
print(f"\nBest Practices:")
print(f"  Sampling: {bp['sampling']}")
print(f"  Retention: {bp['retention']}")

Monitoring ????????? Alerting ????????? Traces

??????????????? alerts ????????? trace data

#!/usr/bin/env python3
# trace_monitor.py ??? Trace-Based Monitoring
import json
import logging
from typing import Dict, List

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger("monitor")

class TraceMonitor:
    def __init__(self):
        pass
    
    def dashboard(self):
        return {
            "service_overview": {
                "order-service": {"rps": 250, "p50_ms": 45, "p95_ms": 180, "p99_ms": 450, "error_rate": "0.5%"},
                "payment-service": {"rps": 200, "p50_ms": 120, "p95_ms": 350, "p99_ms": 800, "error_rate": "0.8%"},
                "inventory-service": {"rps": 300, "p50_ms": 15, "p95_ms": 50, "p99_ms": 120, "error_rate": "0.1%"},
                "notification-service": {"rps": 180, "p50_ms": 30, "p95_ms": 100, "p99_ms": 250, "error_rate": "0.3%"},
                "api-gateway": {"rps": 500, "p50_ms": 80, "p95_ms": 300, "p99_ms": 900, "error_rate": "0.4%"},
            },
            "top_slow_traces": [
                {"trace_id": "abc123", "duration_ms": 2500, "services": 5, "root": "POST /api/orders", "bottleneck": "payment-service (1800ms)"},
                {"trace_id": "def456", "duration_ms": 1800, "services": 4, "root": "GET /api/products", "bottleneck": "inventory-service (1200ms, DB query)"},
                {"trace_id": "ghi789", "duration_ms": 1500, "services": 3, "root": "POST /api/checkout", "bottleneck": "payment-service (900ms, external API)"},
            ],
            "error_traces": [
                {"trace_id": "err001", "error": "PaymentDeclined", "service": "payment-service", "status": 402},
                {"trace_id": "err002", "error": "InventoryUnavailable", "service": "inventory-service", "status": 409},
                {"trace_id": "err003", "error": "Timeout", "service": "notification-service", "status": 504},
            ],
            "alerts": [
                {"severity": "WARNING", "message": "payment-service P99 latency > 800ms (threshold: 500ms)"},
                {"severity": "INFO", "message": "order-service error rate 0.5% (normal)"},
            ],
            "dependency_map": {
                "api-gateway": ["order-service", "product-service"],
                "order-service": ["payment-service", "inventory-service", "notification-service"],
                "payment-service": ["external-payment-api"],
                "inventory-service": ["postgresql"],
                "notification-service": ["redis", "smtp-server"],
            },
        }

monitor = TraceMonitor()
dash = monitor.dashboard()
print("Distributed Tracing Dashboard:")
for svc, info in dash["service_overview"].items():
    print(f"  {svc}: {info['rps']} rps, P95={info['p95_ms']}ms, Errors={info['error_rate']}")

print(f"\nTop Slow Traces:")
for t in dash["top_slow_traces"][:3]:
    print(f"  {t['trace_id']}: {t['duration_ms']}ms ??? {t['bottleneck']}")

print(f"\nRecent Errors:")
for e in dash["error_traces"]:
    print(f"  {e['trace_id']}: {e['error']} ({e['service']})")

print(f"\nAlerts:")
for a in dash["alerts"]:
    print(f"  [{a['severity']}] {a['message']}")

FAQ ??????????????????????????????????????????

Q: Distributed Tracing ????????? Logging ???????????????????????????????????????????

A: Logging ?????????????????? events ??????????????? service ??????????????? ?????? ?????????????????? debugging business logic, ??????????????????????????? text search, ???????????? detailed context Distributed Tracing ?????????????????? request ???????????? services ?????? ?????????????????? ?????????????????????????????? request flow, ????????? latency ??????????????? service, ?????? bottlenecks, ???????????? dependencies ?????????????????????????????????????????????????????? Logs ????????? depth (????????????????????????????????????????????? service), Traces ????????? breadth (?????????????????????????????? services) Best practice Correlate ????????????????????????????????? trace_id ?????? logs ???????????? log ????????????????????????????????? trace_id ???????????? click ????????? trace ???????????? logs ?????????

Q: Sampling strategy ?????????????????????????????????????

A: Head-based sampling ???????????????????????? ??? ?????????????????????????????????????????? trace (???????????? sample 10% ????????? requests) ??????????????? ???????????? ????????????????????? buffer ????????????????????? ????????? miss error traces (90% ????????? errors ?????????????????? sample) Tail-based sampling ????????????????????????????????????????????? trace complete ??????????????? ???????????? 100% ????????? errors + slow traces ????????????????????? ???????????? buffer traces ?????? collector ????????? memory ????????????????????? ??????????????? Production ????????? tail-based sampling ???????????? 100% errors + 100% slow (> 1s) + 10% normal ?????? OpenTelemetry Collector ????????? tail_sampling processor Development/staging ????????? 100% sampling ????????????????????? trace

Q: OpenTelemetry ????????? Jaeger ????????? Zipkin ??????????????????????????????????

A: OpenTelemetry ???????????? instrumentation standard (??????????????? traces) ?????????????????? backend Jaeger/Zipkin/Tempo ???????????? backends (????????????????????? query traces) ????????? OpenTelemetry ?????????????????? instrumentation ???????????? (vendor-neutral) ??????????????????????????? backend Jaeger ?????????????????? Elasticsearch ???????????????????????? ?????? UI ??????, Grafana Tempo ?????????????????? Grafana stack ?????????????????????????????? (S3 storage), Zipkin ?????????????????????????????? simple setup, Cloud services AWS X-Ray, Google Cloud Trace, Datadog APM ?????????????????????????????? manage backend ????????? ??????????????? ??????????????????????????? OpenTelemetry + Grafana Tempo (?????????????????? ?????????????????????) ????????????????????? advanced features ?????????????????????????????? Jaeger ???????????? commercial

Q: Auto-instrumentation ????????? Manual instrumentation ??????????????????????????????????

A: Auto-instrumentation ?????????????????????????????? code ????????? agent/SDK instrument libraries ??????????????????????????? (HTTP, DB, gRPC) ??????????????? ???????????? deploy ???????????????????????? ????????????????????? ??????????????????????????? library-level spans ????????????????????? business logic Manual instrumentation ??????????????? custom spans ?????? code ??????????????? ???????????? business context (order_id, payment_status) ????????????????????? ????????????????????? code ?????????????????????????????????????????? ??????????????? ????????????????????????????????????????????? ??????????????????????????? auto-instrumentation ????????? 80% ????????? value ??????????????? ??????????????????????????? manual spans ?????????????????? critical business paths (checkout, payment, order creation) ??????????????????????????????????????? ????????????????????? instrument ????????? function

📖 บทความที่เกี่ยวข้อง

Python Pydantic Distributed Systemอ่านบทความ → Distributed Tracing Zero Downtime Deploymentอ่านบทความ → XDR Platform Distributed Systemอ่านบทความ → HTTP/3 QUIC Distributed Systemอ่านบทความ → Elixir Ecto Distributed Systemอ่านบทความ →

📚 ดูบทความทั้งหมด →