Distributed Tracing Interview Preparation —

Distributed Tracing Interview

Distributed Tracing Interview Preparation OpenTelemetry Jaeger Span Trace Context Propagation Sampling Observability Microservices Production

เนื้อหาเกี่ยวข้อง — บทความที่เกี่ยวข้อง: BigQuery Scheduled Query MLOps Workflow

Concept	คำอธิบาย	ตัวอย่าง	Interview Tip
Trace	Request ทั้งหมดข้าม Services	User → API → DB → Cache	อธิบาย Tree Structure
Span	Unit of Work ใน 1 Service	HTTP Handler, DB Query	Parent-Child Relationship
Context Propagation	ส่ง Trace ID ข้าม Service	W3C traceparent Header	อธิบาย Header Format
Sampling	เลือก Trace ที่จะเก็บ	Head-based vs Tail-based	Trade-off Cost vs Coverage
Instrumentation	เพิ่ม Tracing ใน Code	Auto vs Manual	OTel SDK Auto-instrumentation

OpenTelemetry Implementation

# === OpenTelemetry Python Setup ===



# pip install opentelemetry-api opentelemetry-sdk

# pip install opentelemetry-instrumentation-flask

# pip install opentelemetry-instrumentation-requests

# pip install opentelemetry-exporter-otlp



# from opentelemetry import trace

# from opentelemetry.sdk.trace import TracerProvider

# from opentelemetry.sdk.trace.export import BatchSpanProcessor

# from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter

# from opentelemetry.sdk.resources import Resource

# from opentelemetry.instrumentation.flask import FlaskInstrumentor

# from opentelemetry.instrumentation.requests import RequestsInstrumentor

#

# # Setup

# resource = Resource.create({

#     "service.name": "order-service",

#     "service.version": "1.2.0",

#     "deployment.environment": "production",

# })

#

# provider = TracerProvider(resource=resource)

# exporter = OTLPSpanExporter(endpoint="http://otel-collector:4317")

# provider.add_span_processor(BatchSpanProcessor(exporter))

# trace.set_tracer_provider(provider)

#

# # Auto-instrumentation

# FlaskInstrumentor().instrument()

# RequestsInstrumentor().instrument()

#

# # Manual Span

# tracer = trace.get_tracer(__name__)

# with tracer.start_as_current_span("process_order") as span:

#     span.set_attribute("order.id", "12345")

#     span.set_attribute("order.total", 99.99)

#     result = process(order)

#     if result.error:

#         span.set_status(trace.StatusCode.ERROR, result.error)



from dataclasses import dataclass



@dataclass

class OTelComponent:

    component: str

    role: str

    config: str

    interview_point: str



components = [

    OTelComponent("SDK (TracerProvider)",

        "สร้างและจัดการ Traces ใน Application",

        "Resource + SpanProcessor + Exporter",

        "อธิบาย Pipeline: SDK → Processor → Exporter"),

    OTelComponent("Auto-instrumentation",

        "เพิ่ม Tracing อัตโนมัติ ไม่ต้องแก้ Code",

        "FlaskInstrumentor RequestsInstrumentor",

        "ลด Developer Effort ครอบคลุม HTTP DB gRPC"),

    OTelComponent("Collector",

        "Gateway รับ Process ส่งต่อ Telemetry",

        "Receivers → Processors → Exporters Pipeline",

        "อธิบาย Collector Pipeline Architecture"),

    OTelComponent("Exporter",

        "ส่งข้อมูลไป Backend (Jaeger/Tempo)",

        "OTLP gRPC/HTTP Jaeger Zipkin",

        "OTLP เป็นมาตรฐาน ไม่ Lock-in"),

    OTelComponent("Propagator",

        "ส่ง Context ข้าม Service",

        "W3C TraceContext B3 (Zipkin)",

        "traceparent: 00-traceid-spanid-flags"),

]



print("=== OTel Components ===")

for c in components:

    print(f"  [{c.component}] {c.role}")

    print(f"    Config: {c.config}")

    print(f"    Interview: {c.interview_point}")

Interview Questions & Answers

# === Interview Q&A ===



@dataclass

class InterviewQA:

    question: str

    short_answer: str

    deep_answer: str

    follow_up: str



questions = [

    InterviewQA("Trace vs Span ต่างกันอย่างไร",

        "Trace = Request ทั้งหมด, Span = Unit of Work ใน 1 Service",

        "Trace มี Trace ID เดียว ประกอบด้วยหลาย Spans Span มี Span ID Parent Span ID สร้าง Tree Structure Root Span คือ Entry Point Child Span คือ Downstream Calls",

        "Span Events vs Span Links ต่างกันอย่างไร"),

    InterviewQA("Context Propagation ทำงานอย่างไร",

        "ส่ง Trace ID + Span ID ผ่าน HTTP Header",

        "W3C traceparent: 00-{trace-id}-{parent-span-id}-{flags} Inject ตอนส่ง Request Extract ตอนรับ Request SDK จัดการอัตโนมัติ ต้อง Configure Propagator ให้ตรงกันทุก Service",

        "ถ้า Service ไม่รองรับ W3C จะทำอย่างไร (B3 Fallback)"),

    InterviewQA("Head-based vs Tail-based Sampling",

        "Head ตัดสินใจต้น Trace, Tail ตัดสินใจท้าย Trace",

        "Head-based: ง่าย ใช้ Resource น้อย ลด Traffic ตั้งแต่ต้น แต่ Miss Error/Slow Traces Tail-based: เก็บทุก Trace ใน Buffer ตัดสินใจหลังจบ เก็บ Error/Slow Traces ได้ แต่ใช้ Resource มาก ต้องมี Collector ที่แข็ง",

        "Probability Sampling vs Rate Limiting vs Always On"),

    InterviewQA("Tracing Overhead เท่าไหร่",

        "1-5% CPU/Latency ขึ้นกับ Sampling Rate และ Span จำนวน",

        "Overhead มาจาก Context Creation Attribute Setting Serialization Network I/O ลดด้วย Sampling ลด Span จำนวน ใช้ BatchSpanProcessor Async Export ใช้ Collector แยก Processing ออกจาก App",

        "วิธีวัด Tracing Overhead จริง (Benchmark)"),

    InterviewQA("Traces กับ Logs เชื่อมกันอย่างไร",

        "ใส่ Trace ID ใน Log เชื่อม Trace กับ Log Entry",

        "OTel SDK inject Trace ID Span ID ใน Log Context ใช้ Log Correlation ดู Log ของ Trace เดียวกัน Structured Logging JSON + trace_id field Query: trace_id=xxx ดู Logs ทั้ง Trace",

        "Exemplars คืออะไร (Metric → Trace Link)"),

]



print("=== Interview Q&A ===")

for q in questions:

    print(f"\n  Q: {q.question}")

    print(f"  A (Short): {q.short_answer}")

    print(f"  A (Deep): {q.deep_answer}")

    print(f"  Follow-up: {q.follow_up}")

Architecture & Production

# === Production Tracing Architecture ===



@dataclass

class ArchComponent:

    layer: str

    tool: str

    config: str

    scaling: str



architecture = [

    ArchComponent("Application Layer",

        "OTel SDK + Auto-instrumentation",

        "Resource Attributes + Sampling + BatchProcessor",

        "ทุก Service ใช้ OTel SDK เดียวกัน"),

    ArchComponent("Collection Layer",

        "OTel Collector (Gateway Mode)",

        "Receivers: OTLP → Processors: Batch/Filter → Exporters: OTLP",

        "Horizontal Scale Collector ตาม Traffic"),

    ArchComponent("Storage Layer",

        "Jaeger + Elasticsearch / Grafana Tempo + S3",

        "Retention 7-14 วัน Index Optimization",

        "Tempo + S3 ถูกกว่า Jaeger + ES 5-10x"),

    ArchComponent("Query Layer",

        "Jaeger UI / Grafana",

        "Search by Trace ID Service Name Duration",

        "Grafana เชื่อม Traces Metrics Logs ใน UI เดียว"),

    ArchComponent("Alerting Layer",

        "Grafana Alerting / PagerDuty",

        "P99 Latency > 500ms Error Rate > 1%",

        "Alert on Trace Metrics ไม่ใช่ Individual Traces"),

]



print("=== Production Architecture ===")

for a in architecture:

    print(f"  [{a.layer}] {a.tool}")

    print(f"    Config: {a.config}")

    print(f"    Scaling: {a.scaling}")

เคล็ดลับ

OTel: ใช้ OpenTelemetry เป็นมาตรฐาน ไม่ Lock-in Vendor
Auto: เริ่ม Auto-instrumentation ก่อน เพิ่ม Manual Span ทีหลัง
Sampling: เริ่ม 10% สำหรับ High Traffic เพิ่มสำหรับ Error
Tempo: ใช้ Grafana Tempo + S3 ลด Cost 5-10x vs Jaeger+ES
Correlate: เชื่อม Traces Metrics Logs ด้วย Trace ID

Distributed Tracing คืออะไร

ติดตาม Request ข้าม Services Trace ID Span Parent-Child Tree Bottleneck Error Dependencies Jaeger Zipkin Tempo OpenTelemetry W3C

แนะนำเพิ่มเติม — ระบบเทรดของ iCafeForex

เนื้อหาเกี่ยวข้อง — ดูเพิ่มเติมเรื่อง IS-IS Protocol High Availability HA Setup

เนื้อหาเกี่ยวข้อง — ดูเพิ่มเติมเรื่อง Delta Lake Feature Flag Management