SiamCafe.net Blog
Technology

RAG Architecture Pod Scheduling

rag architecture pod scheduling
RAG Architecture Pod Scheduling | SiamCafe Blog
2025-06-18· อ. บอม — SiamCafe.net· 9,726 คำ

RAG Pod Scheduling

RAG Architecture Pod Scheduling Kubernetes GPU LLM Embedding Vector Database Retriever HPA KEDA Affinity Toleration Auto-scale

ComponentResourceNode TypeScale Strategy
API GatewayCPU 0.5-2, RAM 512M-2GCPU NodeHPA (Request Rate)
Embedding ServiceCPU 2-4 หรือ GPU 1CPU/GPU NodeHPA (CPU/GPU)
Vector DB (Qdrant)CPU 4-8, RAM 16-64GHigh-memory NodeStatefulSet (Manual)
RetrieverCPU 1-2, RAM 2-4GCPU NodeHPA (Latency)
LLM ServiceGPU 1-2, RAM 16-32GGPU NodeHPA (Queue Depth)
Queue (Redis)CPU 1, RAM 2-4GCPU NodeFixed (HA)

Kubernetes Scheduling

# === RAG Pod Scheduling Configuration ===

# LLM Service - GPU Node Scheduling
# apiVersion: apps/v1
# kind: Deployment
# metadata:
#   name: llm-service
# spec:
#   replicas: 2
#   template:
#     spec:
#       nodeSelector:
#         accelerator: nvidia-a100
#       tolerations:
#         - key: nvidia.com/gpu
#           operator: Exists
#           effect: NoSchedule
#       containers:
#         - name: llm
#           image: llm-service:latest
#           resources:
#             requests:
#               cpu: "4"
#               memory: "16Gi"
#               nvidia.com/gpu: "1"
#             limits:
#               cpu: "8"
#               memory: "32Gi"
#               nvidia.com/gpu: "1"
#       topologySpreadConstraints:
#         - maxSkew: 1
#           topologyKey: topology.kubernetes.io/zone
#           whenUnsatisfiable: ScheduleAnyway
#           labelSelector:
#             matchLabels:
#               app: llm-service

from dataclasses import dataclass

@dataclass
class SchedulingRule:
    component: str
    rule_type: str
    config: str
    reason: str

rules = [
    SchedulingRule("LLM Service",
        "nodeSelector + toleration",
        "accelerator: nvidia-a100 + gpu toleration",
        "ต้องรันบน GPU Node เท่านั้น"),
    SchedulingRule("LLM Service",
        "topologySpreadConstraints",
        "maxSkew: 1 across zones",
        "กระจาย Pod ให้สม่ำเสมอระหว่าง AZ"),
    SchedulingRule("Embedding Service",
        "affinity (preferred)",
        "preferredDuringScheduling: gpu-node",
        "ชอบ GPU แต่ CPU ก็ได้ถ้า GPU เต็ม"),
    SchedulingRule("Vector DB",
        "podAntiAffinity (required)",
        "ห้ามรันบน Node เดียวกัน",
        "HA ป้องกัน Single Point of Failure"),
    SchedulingRule("Retriever",
        "affinity (preferred) + anti-affinity",
        "ชอบอยู่ใกล้ Vector DB, ห้ามรวมกับ LLM",
        "ลด Latency ถึง Vector DB"),
    SchedulingRule("API Gateway",
        "topologySpreadConstraints",
        "กระจายทุก Zone",
        "HA รับ Traffic จากทุก Zone"),
]

print("=== Scheduling Rules ===")
for r in rules:
    print(f"  [{r.component}] {r.rule_type}")
    print(f"    Config: {r.config}")
    print(f"    Reason: {r.reason}")

GPU Management

# === GPU Allocation & Management ===

# NVIDIA Device Plugin DaemonSet
# kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.0/nvidia-device-plugin.yml
#
# GPU Time-slicing (share GPU between pods)
# apiVersion: v1
# kind: ConfigMap
# metadata:
#   name: nvidia-device-plugin
# data:
#   config: |
#     version: v1
#     sharing:
#       timeSlicing:
#         resources:
#           - name: nvidia.com/gpu
#             replicas: 4  # 4 pods share 1 GPU
#
# Resource Quota per Namespace
# apiVersion: v1
# kind: ResourceQuota
# metadata:
#   name: gpu-quota
#   namespace: rag-production
# spec:
#   hard:
#     requests.nvidia.com/gpu: "4"
#     limits.nvidia.com/gpu: "4"

@dataclass
class GPUStrategy:
    strategy: str
    method: str
    use_case: str
    cost_saving: str

gpu_strategies = [
    GPUStrategy("Dedicated GPU",
        "1 GPU per Pod (nvidia.com/gpu: 1)",
        "LLM Inference ที่ต้อง VRAM เต็ม",
        "ไม่ประหยัด แต่ Performance สูงสุด"),
    GPUStrategy("Time-slicing",
        "NVIDIA GPU Operator replicas: 4",
        "Embedding Service หลายตัว Share GPU",
        "ลดค่า GPU 75% (4 pods/GPU)"),
    GPUStrategy("MIG (Multi-Instance GPU)",
        "แบ่ง A100 เป็น 7 Instance",
        "Mixed workload LLM + Embedding",
        "ลดค่า GPU + Isolation ดีกว่า Time-slicing"),
    GPUStrategy("Spot/Preemptible GPU",
        "ใช้ Spot Instance สำหรับ GPU Node",
        "Non-critical workload Batch Processing",
        "ลดค่า GPU 60-70%"),
    GPUStrategy("CPU Fallback",
        "Embedding Service รันบน CPU ได้",
        "เมื่อ GPU เต็ม ใช้ CPU แทน (ช้าลง)",
        "ลดค่า GPU ใช้ CPU ราคาถูกกว่า"),
]

print("=== GPU Strategies ===")
for g in gpu_strategies:
    print(f"  [{g.strategy}] {g.method}")
    print(f"    Use: {g.use_case}")
    print(f"    Saving: {g.cost_saving}")

Auto-scaling

# === RAG Auto-scaling Configuration ===

# HPA for API Gateway
# apiVersion: autoscaling/v2
# kind: HorizontalPodAutoscaler
# metadata:
#   name: api-gateway-hpa
# spec:
#   scaleTargetRef:
#     apiVersion: apps/v1
#     kind: Deployment
#     name: api-gateway
#   minReplicas: 2
#   maxReplicas: 20
#   metrics:
#     - type: Resource
#       resource:
#         name: cpu
#         target:
#           type: Utilization
#           averageUtilization: 70

@dataclass
class AutoScaleConfig:
    component: str
    scaler: str
    metric: str
    min_max: str
    note: str

autoscale = [
    AutoScaleConfig("API Gateway",
        "HPA",
        "CPU > 70% หรือ Request Rate > 500/s",
        "Min: 2, Max: 20",
        "Scale เร็ว Stateless"),
    AutoScaleConfig("Embedding Service",
        "HPA",
        "GPU Utilization > 80% หรือ CPU > 70%",
        "Min: 2, Max: 10",
        "Batch embedding ลด Overhead"),
    AutoScaleConfig("LLM Service",
        "KEDA (Queue-based)",
        "Queue Depth > 5 pending requests",
        "Min: 1, Max: 8",
        "GPU Pod Scale ช้า ใช้ Queue Buffer"),
    AutoScaleConfig("Retriever",
        "HPA",
        "Request Latency p99 > 200ms",
        "Min: 2, Max: 10",
        "Scale ตาม Latency ไม่ใช่ CPU"),
    AutoScaleConfig("GPU Nodes",
        "Cluster Autoscaler",
        "Pending Pods ที่ต้องการ GPU",
        "Min: 1, Max: 4 GPU Nodes",
        "Node provision ใช้เวลา 3-5 นาที"),
]

print("=== Auto-scale Configs ===")
for a in autoscale:
    print(f"  [{a.component}] {a.scaler}")
    print(f"    Metric: {a.metric}")
    print(f"    Replicas: {a.min_max}")
    print(f"    Note: {a.note}")

เคล็ดลับ

RAG Architecture คืออะไร

Retrieval Augmented Generation Embedding Vector DB Retriever LLM Kubernetes Pod Microservices Scale GPU CPU Context Answer

Pod Scheduling ทำอย่างไร

nodeSelector Affinity Toleration topologySpread Resource Request Limit PriorityClass GPU Node CPU Node Zone HA Anti-affinity

GPU Allocation ทำอย่างไร

NVIDIA Device Plugin nvidia.com/gpu Time-slicing MIG Spot Instance ResourceQuota Dedicated Share CPU Fallback A100 VRAM

Auto-scaling ทำอย่างไร

HPA CPU GPU Utilization KEDA Queue Depth VPA Cluster Autoscaler Knative Scale-to-zero Latency Request Rate Preemptive Budget

สรุป

RAG Architecture Pod Scheduling Kubernetes GPU LLM Embedding Retriever HPA KEDA Affinity Toleration Time-slicing MIG Auto-scale Production

📖 บทความที่เกี่ยวข้อง

GraphQL Federation Pod Schedulingอ่านบทความ → BigQuery Scheduled Query Pod Schedulingอ่านบทความ → ClickHouse Analytics Pod Schedulingอ่านบทความ → Fivetran Connector Pod Schedulingอ่านบทความ → QuestDB Time Series Pod Schedulingอ่านบทความ →

📚 ดูบทความทั้งหมด →