Delta Lake Interview
Delta Lake Interview Preparation ACID Transactions Time Travel Schema Evolution Optimization Z-order Partition Vacuum Merge CDC Streaming
| Feature | Delta Lake | Apache Iceberg | Apache Hudi |
|---|---|---|---|
| Origin | Databricks | Netflix | Uber |
| ACID | Yes | Yes | Yes |
| Time Travel | Yes (version/timestamp) | Yes (snapshot) | Yes (timeline) |
| Schema Evolution | Yes | Yes (better) | Yes |
| Best Engine | Spark / Databricks | Spark / Trino / Flink | Spark / Flink |
| CDC Support | Good (CDF) | Good | Excellent (native) |
| Partitioning | Manual + Z-order | Hidden Partitioning | Manual |
Core Concepts
# === Delta Lake Core Concepts ===
from dataclasses import dataclass
# Delta Lake vs Parquet
# Parquet: Just a file format (columnar, compressed)
# Delta Lake: Parquet + Transaction Log + ACID + Time Travel + Schema
# Transaction Log (_delta_log/)
# Each transaction creates a JSON file:
# _delta_log/
# ├── 00000000000000000000.json # Version 0
# ├── 00000000000000000001.json # Version 1
# ├── ...
# └── 00000000000000000010.checkpoint.parquet # Checkpoint every 10
# ACID Properties
# Atomicity: Write succeeds completely or not at all
# Consistency: Schema enforcement prevents bad data
# Isolation: Optimistic concurrency control
# Durability: Data stored in cloud object storage
# Key Operations
# spark.read.format("delta").load("s3://bucket/table")
# df.write.format("delta").mode("overwrite").save("s3://bucket/table")
# spark.sql("MERGE INTO target USING source ON target.id = source.id ...")
# spark.sql("DELETE FROM delta_table WHERE date < '2023-01-01'")
# spark.sql("UPDATE delta_table SET status = 'active' WHERE id = 1")
# Time Travel
# spark.read.format("delta").option("versionAsOf", 5).load(path)
# spark.read.format("delta").option("timestampAsOf", "2024-01-15").load(path)
# spark.sql("SELECT * FROM table VERSION AS OF 5")
# spark.sql("DESCRIBE HISTORY table")
@dataclass
class InterviewQuestion:
question: str
key_points: str
difficulty: str
follow_up: str
questions = [
InterviewQuestion("Delta Lake ต่างจาก Parquet อย่างไร",
"Parquet = file format only, Delta = Parquet + Transaction Log + ACID + Time Travel + Schema Enforcement",
"Basic",
"แล้ว Transaction Log ทำงานอย่างไร"),
InterviewQuestion("อธิบาย ACID Transactions",
"Atomicity (all or nothing), Consistency (schema enforcement), Isolation (optimistic concurrency), Durability (cloud storage)",
"Basic",
"Optimistic Concurrency Control ทำงานอย่างไรเมื่อ 2 writers ชนกัน"),
InterviewQuestion("Time Travel ใช้ทำอะไร",
"Audit trail, rollback bad writes, reproduce ML experiments, debug data issues, compare versions",
"Intermediate",
"Vacuum มีผลกับ Time Travel อย่างไร"),
InterviewQuestion("MERGE INTO ใช้ทำอะไร",
"Upsert: INSERT new rows + UPDATE existing rows in one atomic operation, used for CDC, SCD Type 2",
"Intermediate",
"เขียน MERGE statement สำหรับ SCD Type 2"),
InterviewQuestion("Optimize และ Z-order ทำงานอย่างไร",
"OPTIMIZE compacts small files → larger files (1GB target), Z-ORDER co-locates related data for faster filtered queries",
"Intermediate",
"เลือก Column สำหรับ Z-order อย่างไร"),
]
print("=== Interview Questions ===")
for q in questions:
print(f" [{q.difficulty}] Q: {q.question}")
print(f" Key Points: {q.key_points}")
print(f" Follow-up: {q.follow_up}")
Optimization Deep Dive
# === Delta Lake Optimization ===
# OPTIMIZE — Compact small files
# OPTIMIZE delta_table
# OPTIMIZE delta_table WHERE date = '2024-01-15'
# OPTIMIZE delta_table ZORDER BY (user_id, date)
# Z-ORDER — Co-locate data for faster queries
# Best for columns used in WHERE clause frequently
# High cardinality columns (user_id, product_id)
# Max 4-5 columns, diminishing returns after that
# VACUUM — Remove old files
# VACUUM delta_table RETAIN 168 HOURS -- 7 days
# WARNING: After VACUUM, Time Travel before retention is gone
# Partition Strategy
# Good: date, region (low cardinality)
# Bad: user_id (too many partitions = small files)
# Rule: Each partition should have >= 1GB of data
@dataclass
class OptTechnique:
technique: str
command: str
when: str
impact: str
caution: str
techniques = [
OptTechnique("OPTIMIZE", "OPTIMIZE table [WHERE ...]",
"After many small writes (streaming, frequent batch)",
"Read performance +50-90% (fewer files to scan)",
"Costs compute, schedule during off-peak"),
OptTechnique("Z-ORDER", "OPTIMIZE table ZORDER BY (col1, col2)",
"Columns frequently in WHERE clause",
"Filtered query +30-70% faster",
"Max 4-5 columns, rewrite entire partition"),
OptTechnique("VACUUM", "VACUUM table RETAIN 168 HOURS",
"After OPTIMIZE, periodically for storage cleanup",
"Storage cost reduction, cleanup old versions",
"Breaks Time Travel before retention period"),
OptTechnique("Auto Compaction", "SET spark.databricks.delta.autoCompact.enabled=true",
"Streaming or frequent small writes",
"Auto merge small files after each write",
"Adds latency to write operations"),
OptTechnique("Liquid Clustering", "CREATE TABLE ... CLUSTER BY (col1, col2)",
"Delta Lake 3.0+, replaces Partition + Z-order",
"Incremental clustering, no full rewrite",
"Requires Delta Lake 3.0+ / Databricks Runtime 13.3+"),
]
print("=== Optimization Techniques ===")
for o in techniques:
print(f" [{o.technique}] Command: {o.command}")
print(f" When: {o.when}")
print(f" Impact: {o.impact}")
print(f" Caution: {o.caution}")
Advanced Topics
# === Advanced Interview Topics ===
@dataclass
class AdvancedTopic:
topic: str
explanation: str
use_case: str
interview_tip: str
advanced = [
AdvancedTopic("Change Data Feed (CDF)",
"Track row-level changes (insert, update, delete) as a stream",
"CDC pipelines, audit trails, incremental downstream processing",
"อธิบายว่า CDF ต่างจาก Change Data Capture อย่างไร"),
AdvancedTopic("Deletion Vectors",
"Soft delete by marking rows as deleted instead of rewriting files",
"Faster DELETE/MERGE operations, less I/O",
"อธิบายว่า Deletion Vectors ลด Write Amplification อย่างไร"),
AdvancedTopic("UniForm",
"Delta table readable as Iceberg and Hudi without conversion",
"Multi-engine access (Trino reads Iceberg, Spark reads Delta)",
"อธิบายว่า UniForm แก้ปัญหา Format Lock-in อย่างไร"),
AdvancedTopic("Structured Streaming + Delta",
"Delta table as both streaming source and sink",
"Real-time ETL, exactly-once processing, CDC",
"อธิบาย exactly-once guarantee ทำงานอย่างไร"),
AdvancedTopic("Photon Engine",
"Native C++ query engine for Delta Lake on Databricks",
"10-50x faster queries vs Spark SQL",
"อธิบายว่า Photon optimize อะไรบ้าง"),
]
print("=== Advanced Topics ===")
for a in advanced:
print(f" [{a.topic}]")
print(f" Explanation: {a.explanation}")
print(f" Use Case: {a.use_case}")
print(f" Interview Tip: {a.interview_tip}")
เคล็ดลับ
- Hands-on: ทดลองใช้ Delta Lake จริงใน Databricks Community (ฟรี)
- Compare: เตรียมเปรียบเทียบ Delta vs Iceberg vs Hudi ถามบ่อยมาก
- MERGE: ฝึกเขียน MERGE statement ถามเกือบทุกสัมภาษณ์
- Optimize: อธิบาย OPTIMIZE Z-ORDER VACUUM ได้ชัดเจน
- Use Case: เตรียมตัวอย่างจากประสบการณ์จริงที่ใช้ Delta Lake
การบริหารจัดการฐานข้อมูลอย่างมืออาชีพ
Database Management ที่ดีเริ่มจากการออกแบบ Schema ที่เหมาะสม ใช้ Normalization ลด Data Redundancy สร้าง Index บน Column ที่ Query บ่อย วิเคราะห์ Query Plan เพื่อ Optimize Performance และทำ Regular Maintenance เช่น VACUUM สำหรับ PostgreSQL หรือ OPTIMIZE TABLE สำหรับ MySQL
เรื่อง High Availability ควรติดตั้ง Replication อย่างน้อย 1 Replica สำหรับ Read Scaling และ Disaster Recovery ใช้ Connection Pooling เช่น PgBouncer หรือ ProxySQL ลดภาระ Connection ที่เปิดพร้อมกัน และตั้ง Automated Failover ให้ระบบสลับไป Replica อัตโนมัติเมื่อ Primary ล่ม
Backup ต้องทำทั้ง Full Backup รายวัน และ Incremental Backup ทุก 1-4 ชั่วโมง เก็บ Binary Log หรือ WAL สำหรับ Point-in-Time Recovery ทดสอบ Restore เป็นประจำ และเก็บ Backup ไว้ Off-site ด้วยเสมอ
Delta Lake คืออะไร
Open Source Storage Layer ACID Transactions Parquet Transaction Log Time Travel Schema Enforcement Evolution MERGE UPDATE DELETE Databricks Batch Streaming
คำถามสัมภาษณ์บ่อยมีอะไร
Delta vs Parquet ACID Transaction Log Time Travel Schema MERGE CDC Optimize Z-order Vacuum Delta vs Iceberg vs Hudi Streaming
Optimization ทำอย่างไร
OPTIMIZE Small Files Z-ORDER Column Filter Partition date region Vacuum Storage Auto Compaction Liquid Clustering Deletion Vectors
เทียบกับ Iceberg Hudi อย่างไร
Delta Databricks Spark ACID Z-order Iceberg Netflix Multi-engine Trino Flink Hidden Partitioning Hudi Uber CDC Upsert ทั้ง 3 ACID Time Travel
สรุป
Delta Lake Interview ACID Transactions Time Travel Schema Evolution Optimization Z-order Vacuum MERGE CDC Streaming Iceberg Hudi Databricks
