Delta Lake Interview Preparation —

Delta Lake Interview

Delta Lake Interview Preparation ACID Transactions Time Travel Schema Evolution Optimization Z-order Partition Vacuum Merge CDC Streaming

Feature	Delta Lake	Apache Iceberg	Apache Hudi
Origin	Databricks	Netflix	Uber
ACID	Yes	Yes	Yes
Time Travel	Yes (version/timestamp)	Yes (snapshot)	Yes (timeline)
Schema Evolution	Yes	Yes (better)	Yes
Best Engine	Spark / Databricks	Spark / Trino / Flink	Spark / Flink
CDC Support	Good (CDF)	Good	Excellent (native)
Partitioning	Manual + Z-order	Hidden Partitioning	Manual

Core Concepts

# === Delta Lake Core Concepts ===



from dataclasses import dataclass



# Delta Lake vs Parquet

# Parquet: Just a file format (columnar, compressed)

# Delta Lake: Parquet + Transaction Log + ACID + Time Travel + Schema



# Transaction Log (_delta_log/)

# Each transaction creates a JSON file:

# _delta_log/

# ├── 00000000000000000000.json  # Version 0

# ├── 00000000000000000001.json  # Version 1

# ├── ...

# └── 00000000000000000010.checkpoint.parquet  # Checkpoint every 10



# ACID Properties

# Atomicity: Write succeeds completely or not at all

# Consistency: Schema enforcement prevents bad data

# Isolation: Optimistic concurrency control

# Durability: Data stored in cloud object storage



# Key Operations

# spark.read.format("delta").load("s3://bucket/table")

# df.write.format("delta").mode("overwrite").save("s3://bucket/table")

# spark.sql("MERGE INTO target USING source ON target.id = source.id ...")

# spark.sql("DELETE FROM delta_table WHERE date < '2023-01-01'")

# spark.sql("UPDATE delta_table SET status = 'active' WHERE id = 1")



# Time Travel

# spark.read.format("delta").option("versionAsOf", 5).load(path)

# spark.read.format("delta").option("timestampAsOf", "2024-01-15").load(path)

# spark.sql("SELECT * FROM table VERSION AS OF 5")

# spark.sql("DESCRIBE HISTORY table")



@dataclass

class InterviewQuestion:

    question: str

    key_points: str

    difficulty: str

    follow_up: str



questions = [

    InterviewQuestion("Delta Lake ต่างจาก Parquet อย่างไร",

        "Parquet = file format only, Delta = Parquet + Transaction Log + ACID + Time Travel + Schema Enforcement",

        "Basic",

        "แล้ว Transaction Log ทำงานอย่างไร"),

    InterviewQuestion("อธิบาย ACID Transactions",

        "Atomicity (all or nothing), Consistency (schema enforcement), Isolation (optimistic concurrency), Durability (cloud storage)",

        "Basic",

        "Optimistic Concurrency Control ทำงานอย่างไรเมื่อ 2 writers ชนกัน"),

    InterviewQuestion("Time Travel ใช้ทำอะไร",

        "Audit trail, rollback bad writes, reproduce ML experiments, debug data issues, compare versions",

        "Intermediate",

        "Vacuum มีผลกับ Time Travel อย่างไร"),

    InterviewQuestion("MERGE INTO ใช้ทำอะไร",

        "Upsert: INSERT new rows + UPDATE existing rows in one atomic operation, used for CDC, SCD Type 2",

        "Intermediate",

        "เขียน MERGE statement สำหรับ SCD Type 2"),

    InterviewQuestion("Optimize และ Z-order ทำงานอย่างไร",

        "OPTIMIZE compacts small files → larger files (1GB target), Z-ORDER co-locates related data for faster filtered queries",

        "Intermediate",

        "เลือก Column สำหรับ Z-order อย่างไร"),

]



print("=== Interview Questions ===")

for q in questions:

    print(f"  [{q.difficulty}] Q: {q.question}")

    print(f"    Key Points: {q.key_points}")

    print(f"    Follow-up: {q.follow_up}")

Optimization Deep Dive

# === Delta Lake Optimization ===



# OPTIMIZE — Compact small files

# OPTIMIZE delta_table

# OPTIMIZE delta_table WHERE date = '2024-01-15'

# OPTIMIZE delta_table ZORDER BY (user_id, date)



# Z-ORDER — Co-locate data for faster queries

# Best for columns used in WHERE clause frequently

# High cardinality columns (user_id, product_id)

# Max 4-5 columns, diminishing returns after that



# VACUUM — Remove old files

# VACUUM delta_table RETAIN 168 HOURS  -- 7 days

# WARNING: After VACUUM, Time Travel before retention is gone



# Partition Strategy

# Good: date, region (low cardinality)

# Bad: user_id (too many partitions = small files)

# Rule: Each partition should have >= 1GB of data



@dataclass

class OptTechnique:

    technique: str

    command: str

    when: str

    impact: str

    caution: str



techniques = [

    OptTechnique("OPTIMIZE", "OPTIMIZE table [WHERE ...]",

        "After many small writes (streaming, frequent batch)",

        "Read performance +50-90% (fewer files to scan)",

        "Costs compute, schedule during off-peak"),

    OptTechnique("Z-ORDER", "OPTIMIZE table ZORDER BY (col1, col2)",

        "Columns frequently in WHERE clause",

        "Filtered query +30-70% faster",

        "Max 4-5 columns, rewrite entire partition"),

    OptTechnique("VACUUM", "VACUUM table RETAIN 168 HOURS",

        "After OPTIMIZE, periodically for storage cleanup",

        "Storage cost reduction, cleanup old versions",

        "Breaks Time Travel before retention period"),

    OptTechnique("Auto Compaction", "SET spark.databricks.delta.autoCompact.enabled=true",

        "Streaming or frequent small writes",

        "Auto merge small files after each write",

        "Adds latency to write operations"),

    OptTechnique("Liquid Clustering", "CREATE TABLE ... CLUSTER BY (col1, col2)",

        "Delta Lake 3.0+, replaces Partition + Z-order",

        "Incremental clustering, no full rewrite",

        "Requires Delta Lake 3.0+ / Databricks Runtime 13.3+"),

]



print("=== Optimization Techniques ===")

for o in techniques:

    print(f"  [{o.technique}] Command: {o.command}")

    print(f"    When: {o.when}")

    print(f"    Impact: {o.impact}")

    print(f"    Caution: {o.caution}")

Advanced Topics

# === Advanced Interview Topics ===



@dataclass

class AdvancedTopic:

    topic: str

    explanation: str

    use_case: str

    interview_tip: str



advanced = [

    AdvancedTopic("Change Data Feed (CDF)",

        "Track row-level changes (insert, update, delete) as a stream",

        "CDC pipelines, audit trails, incremental downstream processing",

        "อธิบายว่า CDF ต่างจาก Change Data Capture อย่างไร"),

    AdvancedTopic("Deletion Vectors",

        "Soft delete by marking rows as deleted instead of rewriting files",

        "Faster DELETE/MERGE operations, less I/O",

        "อธิบายว่า Deletion Vectors ลด Write Amplification อย่างไร"),

    AdvancedTopic("UniForm",

        "Delta table readable as Iceberg and Hudi without conversion",

        "Multi-engine access (Trino reads Iceberg, Spark reads Delta)",

        "อธิบายว่า UniForm แก้ปัญหา Format Lock-in อย่างไร"),

    AdvancedTopic("Structured Streaming + Delta",

        "Delta table as both streaming source and sink",

        "Real-time ETL, exactly-once processing, CDC",

        "อธิบาย exactly-once guarantee ทำงานอย่างไร"),

    AdvancedTopic("Photon Engine",

        "Native C++ query engine for Delta Lake on Databricks",

        "10-50x faster queries vs Spark SQL",

        "อธิบายว่า Photon optimize อะไรบ้าง"),

]



print("=== Advanced Topics ===")

for a in advanced:

    print(f"  [{a.topic}]")

    print(f"    Explanation: {a.explanation}")

    print(f"    Use Case: {a.use_case}")

    print(f"    Interview Tip: {a.interview_tip}")

เคล็ดลับ

Hands-on: ทดลองใช้ Delta Lake จริงใน Databricks Community (ฟรี)
Compare: เตรียมเปรียบเทียบ Delta vs Iceberg vs Hudi ถามบ่อยมาก
MERGE: ฝึกเขียน MERGE statement ถามเกือบทุกสัมภาษณ์
Optimize: อธิบาย OPTIMIZE Z-ORDER VACUUM ได้ชัดเจน
Use Case: เตรียมตัวอย่างจากประสบการณ์จริงที่ใช้ Delta Lake

การบริหารจัดการฐานข้อมูลอย่างมืออาชีพ

Database Management ที่ดีเริ่มจากการออกแบบ Schema ที่เหมาะสม ใช้ Normalization ลด Data Redundancy สร้าง Index บน Column ที่ Query บ่อย วิเคราะห์ Query Plan เพื่อ Optimize Performance และทำ Regular Maintenance เช่น VACUUM สำหรับ PostgreSQL หรือ OPTIMIZE TABLE สำหรับ MySQL

เรื่อง High Availability ควรติดตั้ง Replication อย่างน้อย 1 Replica สำหรับ Read Scaling และ Disaster Recovery ใช้ Connection Pooling เช่น PgBouncer หรือ ProxySQL ลดภาระ Connection ที่เปิดพร้อมกัน และตั้ง Automated Failover ให้ระบบสลับไป Replica อัตโนมัติเมื่อ Primary ล่ม

เนื้อหาเกี่ยวข้อง — อ่านต่อ: Passkeys WebAuthn Zero Downtime Deployment

Backup ต้องทำทั้ง Full Backup รายวัน และ Incremental Backup ทุก 1-4 ชั่วโมง เก็บ Binary Log หรือ WAL สำหรับ Point-in-Time Recovery ทดสอบ Restore เป็นประจำ และเก็บ Backup ไว้ Off-site ด้วยเสมอ

แนะนำเพิ่มเติม — SiamCafeBook

Delta Lake คืออะไร

Open Source Storage Layer ACID Transactions Parquet Transaction Log Time Travel Schema Enforcement Evolution MERGE UPDATE DELETE Databricks Batch Streaming

เนื้อหาเกี่ยวข้อง — อ่านต่อ: Linux Systemd Advanced Edge Deployment