RAG Architecture Pod Scheduling — ออกแบบ RAG บน
RAG Pod Scheduling
RAG Architecture Pod Scheduling Kubernetes GPU LLM Embedding Vector Database Retriever HPA KEDA Affinity Toleration Auto-scale
| Component | Resource | Node Type | Scale Strategy |
|---|---|---|---|
| API Gateway | CPU 0.5-2, RAM 512M-2G | CPU Node | HPA (Request Rate) |
| Embedding Service | CPU 2-4 หรือ GPU 1 | CPU/GPU Node | HPA (CPU/GPU) |
| Vector DB (Qdrant) | CPU 4-8, RAM 16-64G | High-memory Node | StatefulSet (Manual) |
| Retriever | CPU 1-2, RAM 2-4G | CPU Node | HPA (Latency) |
| LLM Service | GPU 1-2, RAM 16-32G | GPU Node | HPA (Queue Depth) |
| Queue (Redis) | CPU 1, RAM 2-4G | CPU Node | Fixed (HA) |
Kubernetes Scheduling
# === RAG Pod Scheduling Configuration ===
# LLM Service - GPU Node Scheduling
# apiVersion: apps/v1
# kind: Deployment
# metadata:
# name: llm-service
# spec:
# replicas: 2
# template:
# spec:
# nodeSelector:
# accelerator: nvidia-a100
# tolerations:
# - key: nvidia.com/gpu
# operator: Exists
# effect: NoSchedule
# containers:
# - name: llm
# image: llm-service:latest
# resources:
# requests:
# cpu: "4"
# memory: "16Gi"
# nvidia.com/gpu: "1"
# limits:
# cpu: "8"
# memory: "32Gi"
# nvidia.com/gpu: "1"
# topologySpreadConstraints:
# - maxSkew: 1
# topologyKey: topology.kubernetes.io/zone
# whenUnsatisfiable: ScheduleAnyway
# labelSelector:
# matchLabels:
# app: llm-service
from dataclasses import dataclass
@dataclass
class SchedulingRule:
component: str
rule_type: str
config: str
reason: str
rules = [
SchedulingRule("LLM Service",
"nodeSelector + toleration",
"accelerator: nvidia-a100 + gpu toleration",
"ต้องรันบน GPU Node เท่านั้น"),
SchedulingRule("LLM Service",
"topologySpreadConstraints",
"maxSkew: 1 across zones",
"กระจาย Pod ให้สม่ำเสมอระหว่าง AZ"),
SchedulingRule("Embedding Service",
"affinity (preferred)",
"preferredDuringScheduling: gpu-node",
"ชอบ GPU แต่ CPU ก็ได้ถ้า GPU เต็ม"),
SchedulingRule("Vector DB",
"podAntiAffinity (required)",
"ห้ามรันบน Node เดียวกัน",
"HA ป้องกัน Single Point of Failure"),
SchedulingRule("Retriever",
"affinity (preferred) + anti-affinity",
"ชอบอยู่ใกล้ Vector DB, ห้ามรวมกับ LLM",
"ลด Latency ถึง Vector DB"),
SchedulingRule("API Gateway",
"topologySpreadConstraints",
"กระจายทุก Zone",
"HA รับ Traffic จากทุก Zone"),
]
print("=== Scheduling Rules ===")
for r in rules:
print(f" [{r.component}] {r.rule_type}")
print(f" Config: {r.config}")
print(f" Reason: {r.reason}")
GPU Management
# === GPU Allocation & Management ===
# NVIDIA Device Plugin DaemonSet
# kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.0/nvidia-device-plugin.yml
#
# GPU Time-slicing (share GPU between pods)
# apiVersion: v1
# kind: ConfigMap
# metadata:
# name: nvidia-device-plugin
# data:
# config: |
# version: v1
# sharing:
# timeSlicing:
# resources:
# - name: nvidia.com/gpu
# replicas: 4 # 4 pods share 1 GPU
#
# Resource Quota per Namespace
# apiVersion: v1
# kind: ResourceQuota
# metadata:
# name: gpu-quota
# namespace: rag-production
# spec:
# hard:
# requests.nvidia.com/gpu: "4"
# limits.nvidia.com/gpu: "4"
@dataclass
class GPUStrategy:
strategy: str
method: str
use_case: str
cost_saving: str
gpu_strategies = [
GPUStrategy("Dedicated GPU",
"1 GPU per Pod (nvidia.com/gpu: 1)",
"LLM Inference ที่ต้อง VRAM เต็ม",
"ไม่ประหยัด แต่ Performance สูงสุด"),
GPUStrategy("Time-slicing",
"NVIDIA GPU Operator replicas: 4",
"Embedding Service หลายตัว Share GPU",
"ลดค่า GPU 75% (4 pods/GPU)"),
GPUStrategy("MIG (Multi-Instance GPU)",
"แบ่ง A100 เป็น 7 Instance",
"Mixed workload LLM + Embedding",
"ลดค่า GPU + Isolation ดีกว่า Time-slicing"),
GPUStrategy("Spot/Preemptible GPU",
"ใช้ Spot Instance สำหรับ GPU Node",
"Non-critical workload Batch Processing",
"ลดค่า GPU 60-70%"),
GPUStrategy("CPU Fallback",
"Embedding Service รันบน CPU ได้",
"เมื่อ GPU เต็ม ใช้ CPU แทน (ช้าลง)",
"ลดค่า GPU ใช้ CPU ราคาถูกกว่า"),
]
print("=== GPU Strategies ===")
for g in gpu_strategies:
print(f" [{g.strategy}] {g.method}")
print(f" Use: {g.use_case}")
print(f" Saving: {g.cost_saving}")
Auto-scaling
# === RAG Auto-scaling Configuration ===
# HPA for API Gateway
# apiVersion: autoscaling/v2
# kind: HorizontalPodAutoscaler
# metadata:
# name: api-gateway-hpa
# spec:
# scaleTargetRef:
# apiVersion: apps/v1
# kind: Deployment
# name: api-gateway
# minReplicas: 2
# maxReplicas: 20
# metrics:
# - type: Resource
# resource:
# name: cpu
# target:
# type: Utilization
# averageUtilization: 70
@dataclass
class AutoScaleConfig:
component: str
scaler: str
metric: str
min_max: str
note: str
autoscale = [
AutoScaleConfig("API Gateway",
"HPA",
"CPU > 70% หรือ Request Rate > 500/s",
"Min: 2, Max: 20",
"Scale เร็ว Stateless"),
AutoScaleConfig("Embedding Service",
"HPA",
"GPU Utilization > 80% หรือ CPU > 70%",
"Min: 2, Max: 10",
"Batch embedding ลด Overhead"),
AutoScaleConfig("LLM Service",
"KEDA (Queue-based)",
"Queue Depth > 5 pending requests",
"Min: 1, Max: 8",
"GPU Pod Scale ช้า ใช้ Queue Buffer"),
AutoScaleConfig("Retriever",
"HPA",
"Request Latency p99 > 200ms",
"Min: 2, Max: 10",
"Scale ตาม Latency ไม่ใช่ CPU"),
AutoScaleConfig("GPU Nodes",
"Cluster Autoscaler",
"Pending Pods ที่ต้องการ GPU",
"Min: 1, Max: 4 GPU Nodes",
"Node provision ใช้เวลา 3-5 นาที"),
]
print("=== Auto-scale Configs ===")
for a in autoscale:
print(f" [{a.component}] {a.scaler}")
print(f" Metric: {a.metric}")
print(f" Replicas: {a.min_max}")
print(f" Note: {a.note}")
เคล็ดลับ
- Queue: ใช้ Queue Buffer ระหว่าง API กับ LLM ป้องกัน Overload
- GPU: ใช้ Time-slicing สำหรับ Embedding Share GPU ลดค่าใช้จ่าย
- Spot: ใช้ Spot GPU Instance สำหรับ Non-critical Workload
- Affinity: ตั้ง Retriever ใกล้ Vector DB ลด Network Latency
- Priority: ตั้ง PriorityClass ให้ LLM Pod สำคัญกว่า
RAG Architecture คืออะไร
Retrieval Augmented Generation Embedding Vector DB Retriever LLM Kubernetes Pod Microservices Scale GPU CPU Context Answer
อ่านเพิ่ม: Kubernetes Cost Optimization คืออะไร? ลดค่าใช้จ่าย K8s Clust · อ่านเพิ่ม: Kubernetes Autoscaling คืออะไร? สอน HPA, VPA, KEDA และ Clust · อ่านเพิ่ม: Kubernetes Multi-Tenancy คืออะไร? สอนแชร์ K8s Cluster อย่างป