SiamCafe.net Blog
Technology

Parquet Format Multi-tenant Design

parquet format multi tenant design
Parquet Format Multi-tenant Design | SiamCafe Blog
2025-11-22· อ. บอม — SiamCafe.net· 11,735 คำ

Parquet Multi-tenant Design

Parquet Columnar Multi-tenant Partition Access Control S3 Unity Catalog Predicate Pushdown Column Pruning Compaction Production

DesignIsolationCross-tenant QueryAccess ControlComplexity
Separate Pathสูงยาก (Union)S3 Policy ง่ายต่ำ
Shared Table + Partitionปานกลางง่าย (Filter)Row-level Securityปานกลาง
Catalog per Tenantสูงสุดยาก (Cross-catalog)Catalog GRANTสูง

Partition & File Layout

# === Multi-tenant Parquet Layout ===

# S3 Path Structure
# Separate Path per Tenant:
# s3://data-lake/
# ├── tenant-a/
# │   ├── orders/year=2025/month=01/part-00000.parquet
# │   ├── customers/year=2025/part-00000.parquet
# │   └── products/part-00000.parquet
# ├── tenant-b/
# │   ├── orders/year=2025/month=01/part-00000.parquet
# │   └── ...
#
# Shared Table with Partition:
# s3://data-lake/orders/
# ├── tenant_id=A/year=2025/month=01/part-00000.parquet
# ├── tenant_id=B/year=2025/month=01/part-00000.parquet
# └── tenant_id=C/year=2025/month=01/part-00000.parquet

from dataclasses import dataclass

@dataclass
class PartitionStrategy:
    strategy: str
    path_pattern: str
    partition_columns: str
    file_size: str
    compaction: str

strategies = [
    PartitionStrategy("Time-based (Daily)",
        "tenant_id={tid}/year={y}/month={m}/day={d}/",
        "tenant_id, year, month, day",
        "128MB-256MB per file",
        "Daily: merge small files per partition"),
    PartitionStrategy("Time-based (Monthly)",
        "tenant_id={tid}/year={y}/month={m}/",
        "tenant_id, year, month",
        "256MB-1GB per file",
        "Weekly: merge files per partition"),
    PartitionStrategy("Bucketed (High Cardinality)",
        "tenant_id={tid}/bucket={hash(id) % 100}/",
        "tenant_id, bucket",
        "256MB-512MB per file",
        "Monthly: rebalance buckets"),
]

print("=== Partition Strategies ===")
for s in strategies:
    print(f"  [{s.strategy}]")
    print(f"    Path: {s.path_pattern}")
    print(f"    Columns: {s.partition_columns}")
    print(f"    File Size: {s.file_size}")
    print(f"    Compaction: {s.compaction}")

Access Control

# === Multi-tenant Access Control ===

# AWS S3 Bucket Policy (per Tenant)
# {
#   "Effect": "Allow",
#   "Principal": {"AWS": "arn:aws:iam::role/tenant-a-role"},
#   "Action": ["s3:GetObject", "s3:ListBucket"],
#   "Resource": [
#     "arn:aws:s3:::data-lake/tenant-a/*",
#     "arn:aws:s3:::data-lake"
#   ],
#   "Condition": {
#     "StringLike": {"s3:prefix": ["tenant-a/*"]}
#   }
# }

# Unity Catalog Row-level Security
# CREATE FUNCTION tenant_filter(tenant STRING)
#   RETURN IF(is_member('admin_group'), true, tenant = current_user_tenant());
# ALTER TABLE orders SET ROW FILTER tenant_filter ON (tenant_id);

# Lake Formation (AWS)
# Grant SELECT on database tenant_a_db to role tenant-a-role
# Grant DataCellsFilter on table orders where tenant_id='A'

@dataclass
class AccessLayer:
    layer: str
    mechanism: str
    granularity: str
    example: str

access_layers = [
    AccessLayer("Storage (S3/GCS)",
        "Bucket Policy + IAM Role",
        "Path-level (tenant prefix)",
        "tenant-a-role → s3://lake/tenant-a/* only"),
    AccessLayer("Catalog (Unity/Glue)",
        "GRANT + Row Filter",
        "Table/Column/Row level",
        "GRANT SELECT WHERE tenant_id='A'"),
    AccessLayer("Query Engine (Trino)",
        "Row Level Security Policy",
        "Row-level per query",
        "CREATE POLICY tenant_filter USING(...)"),
    AccessLayer("Encryption (KMS)",
        "Per-tenant KMS Key",
        "File-level encryption",
        "SSE-KMS key-a for tenant-a files"),
    AccessLayer("Audit (CloudTrail)",
        "Access Logging per Tenant",
        "Every S3 GetObject logged",
        "Alert unauthorized cross-tenant access"),
]

print("=== Access Control Layers ===")
for a in access_layers:
    print(f"  [{a.layer}] {a.mechanism}")
    print(f"    Granularity: {a.granularity}")
    print(f"    Example: {a.example}")

Query Optimization

# === Parquet Query Optimization ===

@dataclass
class OptTechnique:
    technique: str
    how: str
    io_reduction: str
    when_to_use: str

optimizations = [
    OptTechnique("Partition Pruning",
        "WHERE tenant_id='A' AND date='2025-01-15' → scan 1 partition only",
        "90-99% I/O reduction",
        "ทุก Query ที่ Filter ตาม Partition Column"),
    OptTechnique("Column Pruning",
        "SELECT col1, col2 FROM table (100 columns) → read 2 columns",
        "95-99% I/O reduction",
        "ทุก Query ที่ไม่ใช้ทุก Column (SELECT *)"),
    OptTechnique("Predicate Pushdown",
        "WHERE amount > 1000 → skip Row Groups where max(amount) < 1000",
        "30-80% I/O reduction",
        "Query ที่ Filter ตาม Non-partition Column"),
    OptTechnique("Z-ordering",
        "OPTIMIZE table ZORDER BY (tenant_id, customer_id)",
        "50-90% I/O reduction for multi-column filter",
        "Query ที่ Filter หลาย Column พร้อมกัน"),
    OptTechnique("Compaction",
        "Merge small files → optimal 128MB-1GB files",
        "50-80% metadata overhead reduction",
        "หลัง Streaming Ingestion ที่สร้าง Small Files"),
    OptTechnique("Caching",
        "Cache Hot Data in Memory (Alluxio Delta Cache)",
        "10-100x faster repeated queries",
        "Dashboard Query ที่รันซ้ำบ่อย"),
]

print("=== Query Optimization ===")
for o in optimizations:
    print(f"  [{o.technique}]")
    print(f"    How: {o.how}")
    print(f"    I/O: {o.io_reduction}")
    print(f"    When: {o.when_to_use}")

เคล็ดลับ

การบริหารจัดการฐานข้อมูลอย่างมืออาชีพ

Database Management ที่ดีเริ่มจากการออกแบบ Schema ที่เหมาะสม ใช้ Normalization ลด Data Redundancy สร้าง Index บน Column ที่ Query บ่อย วิเคราะห์ Query Plan เพื่อ Optimize Performance และทำ Regular Maintenance เช่น VACUUM สำหรับ PostgreSQL หรือ OPTIMIZE TABLE สำหรับ MySQL

เรื่อง High Availability ควรติดตั้ง Replication อย่างน้อย 1 Replica สำหรับ Read Scaling และ Disaster Recovery ใช้ Connection Pooling เช่น PgBouncer หรือ ProxySQL ลดภาระ Connection ที่เปิดพร้อมกัน และตั้ง Automated Failover ให้ระบบสลับไป Replica อัตโนมัติเมื่อ Primary ล่ม

Backup ต้องทำทั้ง Full Backup รายวัน และ Incremental Backup ทุก 1-4 ชั่วโมง เก็บ Binary Log หรือ WAL สำหรับ Point-in-Time Recovery ทดสอบ Restore เป็นประจำ และเก็บ Backup ไว้ Off-site ด้วยเสมอ

Parquet คืออะไร

Columnar Storage Big Data Compression 5-20x Predicate Pushdown Schema Evolution Nested Spark Pandas DuckDB Athena BigQuery

Multi-tenant Design ออกแบบอย่างไร

Separate Path Shared Table Partition Catalog per Tenant tenant_id Top-level S3 Isolation Cross-tenant Query Access Control

Access Control ทำอย่างไร

S3 Bucket Policy IAM Role Unity Catalog GRANT Row Filter Lake Formation KMS Encryption per Tenant Audit CloudTrail Trino RLS

Query Optimization ทำอย่างไร

Partition Pruning Column Pruning Predicate Pushdown Z-ordering Compaction 128MB-1GB Caching Alluxio Delta I/O Reduction 90%+

สรุป

Parquet Columnar Multi-tenant Partition Access Control S3 Unity Catalog KMS Predicate Pushdown Column Pruning Compaction Z-order Production

📖 บทความที่เกี่ยวข้อง

Parquet Format Automation Scriptอ่านบทความ → Parquet Format Hexagonal Architectureอ่านบทความ → Parquet Format Progressive Deliveryอ่านบทความ → Parquet Format Database Migrationอ่านบทความ → Parquet Format Cost Optimization ลดค่าใช้จ่ายอ่านบทความ →

📚 ดูบทความทั้งหมด →