Parquet Format Multi-tenant Design

Q: Parquet คืออะไร

Apache Parquet เป็น Columnar Storage Format สำหรับ Big Data เก็บข้อมูลเป็น Column แทน Row อ่านเฉพาะ Column ที่ต้องการ เร็วกว่า CSV JSON 10-100x Features Column-oriented อ่านเฉพาะ Column ที่ Query Compression บีบอัด 5-20x ประหยัด Storage Predicate Pushdown กรอง Data ก่อนอ่าน ลด I/O Schema Evolution เพิ่ม Column ได้ไม่กระทบ Data เดิม Nested Data รองรับ Struct Array Map Type Metadata เก็บ Statistics min max count ต่อ Column Row Group รองรับ Spark Pandas DuckDB Trino Athena BigQuery เปรียบเทียบ vs CSV ไม่มี Schema ไม่มี Compression ช้า vs JSON Nested ได้ แต่ไม่มี Column Pruning ใหญ่ vs ORC คล้ายกัน Parquet นิยมกว่า Ecosystem ดีกว่า vs Avro Row-based ดีสำหรับ Write Parquet ดีสำหรับ Read/Analytics

Q: Multi-tenant Design ออกแบบอย่างไร

Partition Strategy แบ่ง Partition ตาม tenant_id เป็น Top-level Partition path: s3://data-lake/{tenant_id}/orders/year=2025/month=01/ ข้อดี แต่ละ Tenant แยกกันชัดเจน Access Control ง่าย ใช้ S3 Bucket Policy ต่อ Tenant Scan เฉพาะ Tenant ที่ต้องการ Shared Table + Partition Column ทุก Tenant อยู่ Table เดียว Partition by tenant_id + date path: s3://data-lake/orders/tenant_id=A/year=2025/ ข้อดี Query Cross-tenant ได้ Schema เดียวกัน ข้อเสีย Access Control ซับซ้อนกว่า Catalog per Tenant ใช้ Unity Catalog/Glue Catalog แยกต่อ Tenant แต่ละ Tenant มี Database/Schema ของตัวเอง ข้อดี Isolation สูงสุด Access Control ชัดเจน ข้อเสีย Manage หลาย Catalog เลือกตาม Security Requirement ถ้า Isolation สูง ใช้ Separate Path ถ้า Analytics Cross-tenant ใช้ Shared Table

Q: Access Control ทำอย่างไร

S3 Level Bucket Policy จำกัด IAM Role ต่อ Tenant s3://data-lake/tenant-a/ → Role tenant-a-role เท่านั้น s3://data-lake/tenant-b/ → Role tenant-b-role เท่านั้น Catalog Level Unity Catalog GRANT SELECT ON TABLE orders TO tenant_a_group WHERE tenant_id = 'A' Glue Catalog Resource Policy + Lake Formation AWS Lake Formation Column-level Row-level Security Query Engine Level Trino/Presto Row Level Security CREATE POLICY tenant_filter ON orders USING (tenant_id = current_user_tenant()) Spark Row Filter spark.sql("SELECT * FROM orders").filter(col("tenant_id") == current_tenant) Encryption Encrypt Data per Tenant ใช้ KMS Key ต่อ Tenant SSE-KMS แต่ละ Tenant มี Key ของตัวเอง Client-side Encryption สำหรับ Sensitive Data Audit Logging บันทึกทุกการ Access Query ต่อ Tenant ใช้ CloudTrail (AWS) Audit Log (Databricks) ตรวจ Unauthorized Access

Q: Query Optimization ทำอย่างไร

Partition Pruning Partition by tenant_id + date Query WHERE tenant_id='A' AND date='2025-01-15' Scan เฉพาะ Partition ที่ตรง ลด I/O 90%+ Predicate Pushdown Parquet เก็บ min/max ต่อ Row Group Query WHERE amount > 1000 ข้าม Row Group ที่ max < 1000 Column Pruning SELECT name, amount จาก Table 100 Columns อ่านเฉพาะ 2 Columns ลด I/O 98% File Size Optimal 128MB-1GB ต่อ File เล็กเกินไป Too many files metadata overhead ใหญ่เกินไป ไม่ Parallel ดี Compaction รวม Small Files เป็น File ใหญ่ รัน Compaction Job ทุกวัน/สัปดาห์ Z-ordering Sort Data ตาม Column ที่ Filter บ่อย เช่น Z-ORDER BY (tenant_id, date, customer_id) ลด Data Scan สำหรับ Multi-column Filter Caching Cache Hot Data ใน Memory (Alluxio Delta Cache) Cache Metadata ลด Catalog Query

Parquet Multi-tenant Design

Parquet Columnar Multi-tenant Partition Access Control S3 Unity Catalog Predicate Pushdown Column Pruning Compaction Production

Design	Isolation	Cross-tenant Query	Access Control	Complexity
Separate Path	สูง	ยาก (Union)	S3 Policy ง่าย	ต่ำ
Shared Table + Partition	ปานกลาง	ง่าย (Filter)	Row-level Security	ปานกลาง
Catalog per Tenant	สูงสุด	ยาก (Cross-catalog)	Catalog GRANT	สูง

Partition & File Layout

# === Multi-tenant Parquet Layout ===

# S3 Path Structure
# Separate Path per Tenant:
# s3://data-lake/
# ├── tenant-a/
# │   ├── orders/year=2025/month=01/part-00000.parquet
# │   ├── customers/year=2025/part-00000.parquet
# │   └── products/part-00000.parquet
# ├── tenant-b/
# │   ├── orders/year=2025/month=01/part-00000.parquet
# │   └── ...
#
# Shared Table with Partition:
# s3://data-lake/orders/
# ├── tenant_id=A/year=2025/month=01/part-00000.parquet
# ├── tenant_id=B/year=2025/month=01/part-00000.parquet
# └── tenant_id=C/year=2025/month=01/part-00000.parquet

from dataclasses import dataclass

@dataclass
class PartitionStrategy:
    strategy: str
    path_pattern: str
    partition_columns: str
    file_size: str
    compaction: str

strategies = [
    PartitionStrategy("Time-based (Daily)",
        "tenant_id={tid}/year={y}/month={m}/day={d}/",
        "tenant_id, year, month, day",
        "128MB-256MB per file",
        "Daily: merge small files per partition"),
    PartitionStrategy("Time-based (Monthly)",
        "tenant_id={tid}/year={y}/month={m}/",
        "tenant_id, year, month",
        "256MB-1GB per file",
        "Weekly: merge files per partition"),
    PartitionStrategy("Bucketed (High Cardinality)",
        "tenant_id={tid}/bucket={hash(id) % 100}/",
        "tenant_id, bucket",
        "256MB-512MB per file",
        "Monthly: rebalance buckets"),
]

print("=== Partition Strategies ===")
for s in strategies:
    print(f"  [{s.strategy}]")
    print(f"    Path: {s.path_pattern}")
    print(f"    Columns: {s.partition_columns}")
    print(f"    File Size: {s.file_size}")
    print(f"    Compaction: {s.compaction}")

Access Control

# === Multi-tenant Access Control ===

# AWS S3 Bucket Policy (per Tenant)
# {
#   "Effect": "Allow",
#   "Principal": {"AWS": "arn:aws:iam::role/tenant-a-role"},
#   "Action": ["s3:GetObject", "s3:ListBucket"],
#   "Resource": [
#     "arn:aws:s3:::data-lake/tenant-a/*",
#     "arn:aws:s3:::data-lake"
#   ],
#   "Condition": {
#     "StringLike": {"s3:prefix": ["tenant-a/*"]}
#   }
# }

# Unity Catalog Row-level Security
# CREATE FUNCTION tenant_filter(tenant STRING)
#   RETURN IF(is_member('admin_group'), true, tenant = current_user_tenant());
# ALTER TABLE orders SET ROW FILTER tenant_filter ON (tenant_id);

# Lake Formation (AWS)
# Grant SELECT on database tenant_a_db to role tenant-a-role
# Grant DataCellsFilter on table orders where tenant_id='A'

@dataclass
class AccessLayer:
    layer: str
    mechanism: str
    granularity: str
    example: str

access_layers = [
    AccessLayer("Storage (S3/GCS)",
        "Bucket Policy + IAM Role",
        "Path-level (tenant prefix)",
        "tenant-a-role → s3://lake/tenant-a/* only"),
    AccessLayer("Catalog (Unity/Glue)",
        "GRANT + Row Filter",
        "Table/Column/Row level",
        "GRANT SELECT WHERE tenant_id='A'"),
    AccessLayer("Query Engine (Trino)",
        "Row Level Security Policy",
        "Row-level per query",
        "CREATE POLICY tenant_filter USING(...)"),
    AccessLayer("Encryption (KMS)",
        "Per-tenant KMS Key",
        "File-level encryption",
        "SSE-KMS key-a for tenant-a files"),
    AccessLayer("Audit (CloudTrail)",
        "Access Logging per Tenant",
        "Every S3 GetObject logged",
        "Alert unauthorized cross-tenant access"),
]

print("=== Access Control Layers ===")
for a in access_layers:
    print(f"  [{a.layer}] {a.mechanism}")
    print(f"    Granularity: {a.granularity}")
    print(f"    Example: {a.example}")

Query Optimization

# === Parquet Query Optimization ===

@dataclass
class OptTechnique:
    technique: str
    how: str
    io_reduction: str
    when_to_use: str

optimizations = [
    OptTechnique("Partition Pruning",
        "WHERE tenant_id='A' AND date='2025-01-15' → scan 1 partition only",
        "90-99% I/O reduction",
        "ทุก Query ที่ Filter ตาม Partition Column"),
    OptTechnique("Column Pruning",
        "SELECT col1, col2 FROM table (100 columns) → read 2 columns",
        "95-99% I/O reduction",
        "ทุก Query ที่ไม่ใช้ทุก Column (SELECT *)"),
    OptTechnique("Predicate Pushdown",
        "WHERE amount > 1000 → skip Row Groups where max(amount) < 1000",
        "30-80% I/O reduction",
        "Query ที่ Filter ตาม Non-partition Column"),
    OptTechnique("Z-ordering",
        "OPTIMIZE table ZORDER BY (tenant_id, customer_id)",
        "50-90% I/O reduction for multi-column filter",
        "Query ที่ Filter หลาย Column พร้อมกัน"),
    OptTechnique("Compaction",
        "Merge small files → optimal 128MB-1GB files",
        "50-80% metadata overhead reduction",
        "หลัง Streaming Ingestion ที่สร้าง Small Files"),
    OptTechnique("Caching",
        "Cache Hot Data in Memory (Alluxio Delta Cache)",
        "10-100x faster repeated queries",
        "Dashboard Query ที่รันซ้ำบ่อย"),
]

print("=== Query Optimization ===")
for o in optimizations:
    print(f"  [{o.technique}]")
    print(f"    How: {o.how}")
    print(f"    I/O: {o.io_reduction}")
    print(f"    When: {o.when_to_use}")

เคล็ดลับ

Partition: Partition ตาม tenant_id เป็น Top-level เสมอ
File Size: รักษา File Size 128MB-1GB ด้วย Compaction
Column Pruning: ไม่ใช้ SELECT * เลือกเฉพาะ Column ที่ต้องการ
Encryption: ใช้ KMS Key แยกต่อ Tenant สำหรับ Compliance
Audit: เปิด Access Logging ตรวจ Cross-tenant Access

การบริหารจัดการฐานข้อมูลอย่างมืออาชีพ

Database Management ที่ดีเริ่มจากการออกแบบ Schema ที่เหมาะสม ใช้ Normalization ลด Data Redundancy สร้าง Index บน Column ที่ Query บ่อย วิเคราะห์ Query Plan เพื่อ Optimize Performance และทำ Regular Maintenance เช่น VACUUM สำหรับ PostgreSQL หรือ OPTIMIZE TABLE สำหรับ MySQL

เรื่อง High Availability ควรติดตั้ง Replication อย่างน้อย 1 Replica สำหรับ Read Scaling และ Disaster Recovery ใช้ Connection Pooling เช่น PgBouncer หรือ ProxySQL ลดภาระ Connection ที่เปิดพร้อมกัน และตั้ง Automated Failover ให้ระบบสลับไป Replica อัตโนมัติเมื่อ Primary ล่ม

Backup ต้องทำทั้ง Full Backup รายวัน และ Incremental Backup ทุก 1-4 ชั่วโมง เก็บ Binary Log หรือ WAL สำหรับ Point-in-Time Recovery ทดสอบ Restore เป็นประจำ และเก็บ Backup ไว้ Off-site ด้วยเสมอ

Parquet คืออะไร

Columnar Storage Big Data Compression 5-20x Predicate Pushdown Schema Evolution Nested Spark Pandas DuckDB Athena BigQuery

Multi-tenant Design ออกแบบอย่างไร

Separate Path Shared Table Partition Catalog per Tenant tenant_id Top-level S3 Isolation Cross-tenant Query Access Control

Access Control ทำอย่างไร

S3 Bucket Policy IAM Role Unity Catalog GRANT Row Filter Lake Formation KMS Encryption per Tenant Audit CloudTrail Trino RLS

Query Optimization ทำอย่างไร

Partition Pruning Column Pruning Predicate Pushdown Z-ordering Compaction 128MB-1GB Caching Alluxio Delta I/O Reduction 90%+

สรุป

Parquet Columnar Multi-tenant Partition Access Control S3 Unity Catalog KMS Predicate Pushdown Column Pruning Compaction Z-order Production