Technology

Delta Lake Interview Preparation

delta lake interview preparation
Delta Lake Interview Preparation | SiamCafe Blog
2026-04-24· อ. บอม — SiamCafe.net· 9,555 คำ

Delta Lake Interview

Delta Lake Interview Preparation ACID Transactions Time Travel Schema Evolution Optimization Z-order Partition Vacuum Merge CDC Streaming

FeatureDelta LakeApache IcebergApache Hudi
OriginDatabricksNetflixUber
ACIDYesYesYes
Time TravelYes (version/timestamp)Yes (snapshot)Yes (timeline)
Schema EvolutionYesYes (better)Yes
Best EngineSpark / DatabricksSpark / Trino / FlinkSpark / Flink
CDC SupportGood (CDF)GoodExcellent (native)
PartitioningManual + Z-orderHidden PartitioningManual

Core Concepts

# === Delta Lake Core Concepts ===

from dataclasses import dataclass

# Delta Lake vs Parquet
# Parquet: Just a file format (columnar, compressed)
# Delta Lake: Parquet + Transaction Log + ACID + Time Travel + Schema

# Transaction Log (_delta_log/)
# Each transaction creates a JSON file:
# _delta_log/
# ├── 00000000000000000000.json  # Version 0
# ├── 00000000000000000001.json  # Version 1
# ├── ...
# └── 00000000000000000010.checkpoint.parquet  # Checkpoint every 10

# ACID Properties
# Atomicity: Write succeeds completely or not at all
# Consistency: Schema enforcement prevents bad data
# Isolation: Optimistic concurrency control
# Durability: Data stored in cloud object storage

# Key Operations
# spark.read.format("delta").load("s3://bucket/table")
# df.write.format("delta").mode("overwrite").save("s3://bucket/table")
# spark.sql("MERGE INTO target USING source ON target.id = source.id ...")
# spark.sql("DELETE FROM delta_table WHERE date < '2023-01-01'")
# spark.sql("UPDATE delta_table SET status = 'active' WHERE id = 1")

# Time Travel
# spark.read.format("delta").option("versionAsOf", 5).load(path)
# spark.read.format("delta").option("timestampAsOf", "2024-01-15").load(path)
# spark.sql("SELECT * FROM table VERSION AS OF 5")
# spark.sql("DESCRIBE HISTORY table")

@dataclass
class InterviewQuestion:
    question: str
    key_points: str
    difficulty: str
    follow_up: str

questions = [
    InterviewQuestion("Delta Lake ต่างจาก Parquet อย่างไร",
        "Parquet = file format only, Delta = Parquet + Transaction Log + ACID + Time Travel + Schema Enforcement",
        "Basic",
        "แล้ว Transaction Log ทำงานอย่างไร"),
    InterviewQuestion("อธิบาย ACID Transactions",
        "Atomicity (all or nothing), Consistency (schema enforcement), Isolation (optimistic concurrency), Durability (cloud storage)",
        "Basic",
        "Optimistic Concurrency Control ทำงานอย่างไรเมื่อ 2 writers ชนกัน"),
    InterviewQuestion("Time Travel ใช้ทำอะไร",
        "Audit trail, rollback bad writes, reproduce ML experiments, debug data issues, compare versions",
        "Intermediate",
        "Vacuum มีผลกับ Time Travel อย่างไร"),
    InterviewQuestion("MERGE INTO ใช้ทำอะไร",
        "Upsert: INSERT new rows + UPDATE existing rows in one atomic operation, used for CDC, SCD Type 2",
        "Intermediate",
        "เขียน MERGE statement สำหรับ SCD Type 2"),
    InterviewQuestion("Optimize และ Z-order ทำงานอย่างไร",
        "OPTIMIZE compacts small files → larger files (1GB target), Z-ORDER co-locates related data for faster filtered queries",
        "Intermediate",
        "เลือก Column สำหรับ Z-order อย่างไร"),
]

print("=== Interview Questions ===")
for q in questions:
    print(f"  [{q.difficulty}] Q: {q.question}")
    print(f"    Key Points: {q.key_points}")
    print(f"    Follow-up: {q.follow_up}")

Optimization Deep Dive

# === Delta Lake Optimization ===

# OPTIMIZE — Compact small files
# OPTIMIZE delta_table
# OPTIMIZE delta_table WHERE date = '2024-01-15'
# OPTIMIZE delta_table ZORDER BY (user_id, date)

# Z-ORDER — Co-locate data for faster queries
# Best for columns used in WHERE clause frequently
# High cardinality columns (user_id, product_id)
# Max 4-5 columns, diminishing returns after that

# VACUUM — Remove old files
# VACUUM delta_table RETAIN 168 HOURS  -- 7 days
# WARNING: After VACUUM, Time Travel before retention is gone

# Partition Strategy
# Good: date, region (low cardinality)
# Bad: user_id (too many partitions = small files)
# Rule: Each partition should have >= 1GB of data

@dataclass
class OptTechnique:
    technique: str
    command: str
    when: str
    impact: str
    caution: str

techniques = [
    OptTechnique("OPTIMIZE", "OPTIMIZE table [WHERE ...]",
        "After many small writes (streaming, frequent batch)",
        "Read performance +50-90% (fewer files to scan)",
        "Costs compute, schedule during off-peak"),
    OptTechnique("Z-ORDER", "OPTIMIZE table ZORDER BY (col1, col2)",
        "Columns frequently in WHERE clause",
        "Filtered query +30-70% faster",
        "Max 4-5 columns, rewrite entire partition"),
    OptTechnique("VACUUM", "VACUUM table RETAIN 168 HOURS",
        "After OPTIMIZE, periodically for storage cleanup",
        "Storage cost reduction, cleanup old versions",
        "Breaks Time Travel before retention period"),
    OptTechnique("Auto Compaction", "SET spark.databricks.delta.autoCompact.enabled=true",
        "Streaming or frequent small writes",
        "Auto merge small files after each write",
        "Adds latency to write operations"),
    OptTechnique("Liquid Clustering", "CREATE TABLE ... CLUSTER BY (col1, col2)",
        "Delta Lake 3.0+, replaces Partition + Z-order",
        "Incremental clustering, no full rewrite",
        "Requires Delta Lake 3.0+ / Databricks Runtime 13.3+"),
]

print("=== Optimization Techniques ===")
for o in techniques:
    print(f"  [{o.technique}] Command: {o.command}")
    print(f"    When: {o.when}")
    print(f"    Impact: {o.impact}")
    print(f"    Caution: {o.caution}")

Advanced Topics

# === Advanced Interview Topics ===

@dataclass
class AdvancedTopic:
    topic: str
    explanation: str
    use_case: str
    interview_tip: str

advanced = [
    AdvancedTopic("Change Data Feed (CDF)",
        "Track row-level changes (insert, update, delete) as a stream",
        "CDC pipelines, audit trails, incremental downstream processing",
        "อธิบายว่า CDF ต่างจาก Change Data Capture อย่างไร"),
    AdvancedTopic("Deletion Vectors",
        "Soft delete by marking rows as deleted instead of rewriting files",
        "Faster DELETE/MERGE operations, less I/O",
        "อธิบายว่า Deletion Vectors ลด Write Amplification อย่างไร"),
    AdvancedTopic("UniForm",
        "Delta table readable as Iceberg and Hudi without conversion",
        "Multi-engine access (Trino reads Iceberg, Spark reads Delta)",
        "อธิบายว่า UniForm แก้ปัญหา Format Lock-in อย่างไร"),
    AdvancedTopic("Structured Streaming + Delta",
        "Delta table as both streaming source and sink",
        "Real-time ETL, exactly-once processing, CDC",
        "อธิบาย exactly-once guarantee ทำงานอย่างไร"),
    AdvancedTopic("Photon Engine",
        "Native C++ query engine for Delta Lake on Databricks",
        "10-50x faster queries vs Spark SQL",
        "อธิบายว่า Photon optimize อะไรบ้าง"),
]

print("=== Advanced Topics ===")
for a in advanced:
    print(f"  [{a.topic}]")
    print(f"    Explanation: {a.explanation}")
    print(f"    Use Case: {a.use_case}")
    print(f"    Interview Tip: {a.interview_tip}")

เคล็ดลับ

การบริหารจัดการฐานข้อมูลอย่างมืออาชีพ

Database Management ที่ดีเริ่มจากการออกแบบ Schema ที่เหมาะสม ใช้ Normalization ลด Data Redundancy สร้าง Index บน Column ที่ Query บ่อย วิเคราะห์ Query Plan เพื่อ Optimize Performance และทำ Regular Maintenance เช่น VACUUM สำหรับ PostgreSQL หรือ OPTIMIZE TABLE สำหรับ MySQL

เรื่อง High Availability ควรติดตั้ง Replication อย่างน้อย 1 Replica สำหรับ Read Scaling และ Disaster Recovery ใช้ Connection Pooling เช่น PgBouncer หรือ ProxySQL ลดภาระ Connection ที่เปิดพร้อมกัน และตั้ง Automated Failover ให้ระบบสลับไป Replica อัตโนมัติเมื่อ Primary ล่ม

Backup ต้องทำทั้ง Full Backup รายวัน และ Incremental Backup ทุก 1-4 ชั่วโมง เก็บ Binary Log หรือ WAL สำหรับ Point-in-Time Recovery ทดสอบ Restore เป็นประจำ และเก็บ Backup ไว้ Off-site ด้วยเสมอ

Delta Lake คืออะไร

Open Source Storage Layer ACID Transactions Parquet Transaction Log Time Travel Schema Enforcement Evolution MERGE UPDATE DELETE Databricks Batch Streaming

คำถามสัมภาษณ์บ่อยมีอะไร

Delta vs Parquet ACID Transaction Log Time Travel Schema MERGE CDC Optimize Z-order Vacuum Delta vs Iceberg vs Hudi Streaming

Optimization ทำอย่างไร

OPTIMIZE Small Files Z-ORDER Column Filter Partition date region Vacuum Storage Auto Compaction Liquid Clustering Deletion Vectors

เทียบกับ Iceberg Hudi อย่างไร

Delta Databricks Spark ACID Z-order Iceberg Netflix Multi-engine Trino Flink Hidden Partitioning Hudi Uber CDC Upsert ทั้ง 3 ACID Time Travel

สรุป

Delta Lake Interview ACID Transactions Time Travel Schema Evolution Optimization Z-order Vacuum MERGE CDC Streaming Iceberg Hudi Databricks

📖 บทความที่เกี่ยวข้อง

Delta Lake Metric Collectionอ่านบทความ → Delta Lake API Gateway Patternอ่านบทความ → Azure DevOps Pipeline Interview Preparationอ่านบทความ → Kustomize Overlay Interview Preparationอ่านบทความ → Delta Lake Cloud Migration Strategyอ่านบทความ →

📚 ดูบทความทั้งหมด →