RAG Pod Scheduling
RAG Architecture Pod Scheduling Kubernetes GPU LLM Embedding Vector Database Retriever HPA KEDA Affinity Toleration Auto-scale
| Component | Resource | Node Type | Scale Strategy |
|---|---|---|---|
| API Gateway | CPU 0.5-2, RAM 512M-2G | CPU Node | HPA (Request Rate) |
| Embedding Service | CPU 2-4 หรือ GPU 1 | CPU/GPU Node | HPA (CPU/GPU) |
| Vector DB (Qdrant) | CPU 4-8, RAM 16-64G | High-memory Node | StatefulSet (Manual) |
| Retriever | CPU 1-2, RAM 2-4G | CPU Node | HPA (Latency) |
| LLM Service | GPU 1-2, RAM 16-32G | GPU Node | HPA (Queue Depth) |
| Queue (Redis) | CPU 1, RAM 2-4G | CPU Node | Fixed (HA) |
Kubernetes Scheduling
# === RAG Pod Scheduling Configuration ===
# LLM Service - GPU Node Scheduling
# apiVersion: apps/v1
# kind: Deployment
# metadata:
# name: llm-service
# spec:
# replicas: 2
# template:
# spec:
# nodeSelector:
# accelerator: nvidia-a100
# tolerations:
# - key: nvidia.com/gpu
# operator: Exists
# effect: NoSchedule
# containers:
# - name: llm
# image: llm-service:latest
# resources:
# requests:
# cpu: "4"
# memory: "16Gi"
# nvidia.com/gpu: "1"
# limits:
# cpu: "8"
# memory: "32Gi"
# nvidia.com/gpu: "1"
# topologySpreadConstraints:
# - maxSkew: 1
# topologyKey: topology.kubernetes.io/zone
# whenUnsatisfiable: ScheduleAnyway
# labelSelector:
# matchLabels:
# app: llm-service
from dataclasses import dataclass
@dataclass
class SchedulingRule:
component: str
rule_type: str
config: str
reason: str
rules = [
SchedulingRule("LLM Service",
"nodeSelector + toleration",
"accelerator: nvidia-a100 + gpu toleration",
"ต้องรันบน GPU Node เท่านั้น"),
SchedulingRule("LLM Service",
"topologySpreadConstraints",
"maxSkew: 1 across zones",
"กระจาย Pod ให้สม่ำเสมอระหว่าง AZ"),
SchedulingRule("Embedding Service",
"affinity (preferred)",
"preferredDuringScheduling: gpu-node",
"ชอบ GPU แต่ CPU ก็ได้ถ้า GPU เต็ม"),
SchedulingRule("Vector DB",
"podAntiAffinity (required)",
"ห้ามรันบน Node เดียวกัน",
"HA ป้องกัน Single Point of Failure"),
SchedulingRule("Retriever",
"affinity (preferred) + anti-affinity",
"ชอบอยู่ใกล้ Vector DB, ห้ามรวมกับ LLM",
"ลด Latency ถึง Vector DB"),
SchedulingRule("API Gateway",
"topologySpreadConstraints",
"กระจายทุก Zone",
"HA รับ Traffic จากทุก Zone"),
]
print("=== Scheduling Rules ===")
for r in rules:
print(f" [{r.component}] {r.rule_type}")
print(f" Config: {r.config}")
print(f" Reason: {r.reason}")
GPU Management
# === GPU Allocation & Management ===
# NVIDIA Device Plugin DaemonSet
# kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.0/nvidia-device-plugin.yml
#
# GPU Time-slicing (share GPU between pods)
# apiVersion: v1
# kind: ConfigMap
# metadata:
# name: nvidia-device-plugin
# data:
# config: |
# version: v1
# sharing:
# timeSlicing:
# resources:
# - name: nvidia.com/gpu
# replicas: 4 # 4 pods share 1 GPU
#
# Resource Quota per Namespace
# apiVersion: v1
# kind: ResourceQuota
# metadata:
# name: gpu-quota
# namespace: rag-production
# spec:
# hard:
# requests.nvidia.com/gpu: "4"
# limits.nvidia.com/gpu: "4"
@dataclass
class GPUStrategy:
strategy: str
method: str
use_case: str
cost_saving: str
gpu_strategies = [
GPUStrategy("Dedicated GPU",
"1 GPU per Pod (nvidia.com/gpu: 1)",
"LLM Inference ที่ต้อง VRAM เต็ม",
"ไม่ประหยัด แต่ Performance สูงสุด"),
GPUStrategy("Time-slicing",
"NVIDIA GPU Operator replicas: 4",
"Embedding Service หลายตัว Share GPU",
"ลดค่า GPU 75% (4 pods/GPU)"),
GPUStrategy("MIG (Multi-Instance GPU)",
"แบ่ง A100 เป็น 7 Instance",
"Mixed workload LLM + Embedding",
"ลดค่า GPU + Isolation ดีกว่า Time-slicing"),
GPUStrategy("Spot/Preemptible GPU",
"ใช้ Spot Instance สำหรับ GPU Node",
"Non-critical workload Batch Processing",
"ลดค่า GPU 60-70%"),
GPUStrategy("CPU Fallback",
"Embedding Service รันบน CPU ได้",
"เมื่อ GPU เต็ม ใช้ CPU แทน (ช้าลง)",
"ลดค่า GPU ใช้ CPU ราคาถูกกว่า"),
]
print("=== GPU Strategies ===")
for g in gpu_strategies:
print(f" [{g.strategy}] {g.method}")
print(f" Use: {g.use_case}")
print(f" Saving: {g.cost_saving}")
Auto-scaling
# === RAG Auto-scaling Configuration ===
# HPA for API Gateway
# apiVersion: autoscaling/v2
# kind: HorizontalPodAutoscaler
# metadata:
# name: api-gateway-hpa
# spec:
# scaleTargetRef:
# apiVersion: apps/v1
# kind: Deployment
# name: api-gateway
# minReplicas: 2
# maxReplicas: 20
# metrics:
# - type: Resource
# resource:
# name: cpu
# target:
# type: Utilization
# averageUtilization: 70
@dataclass
class AutoScaleConfig:
component: str
scaler: str
metric: str
min_max: str
note: str
autoscale = [
AutoScaleConfig("API Gateway",
"HPA",
"CPU > 70% หรือ Request Rate > 500/s",
"Min: 2, Max: 20",
"Scale เร็ว Stateless"),
AutoScaleConfig("Embedding Service",
"HPA",
"GPU Utilization > 80% หรือ CPU > 70%",
"Min: 2, Max: 10",
"Batch embedding ลด Overhead"),
AutoScaleConfig("LLM Service",
"KEDA (Queue-based)",
"Queue Depth > 5 pending requests",
"Min: 1, Max: 8",
"GPU Pod Scale ช้า ใช้ Queue Buffer"),
AutoScaleConfig("Retriever",
"HPA",
"Request Latency p99 > 200ms",
"Min: 2, Max: 10",
"Scale ตาม Latency ไม่ใช่ CPU"),
AutoScaleConfig("GPU Nodes",
"Cluster Autoscaler",
"Pending Pods ที่ต้องการ GPU",
"Min: 1, Max: 4 GPU Nodes",
"Node provision ใช้เวลา 3-5 นาที"),
]
print("=== Auto-scale Configs ===")
for a in autoscale:
print(f" [{a.component}] {a.scaler}")
print(f" Metric: {a.metric}")
print(f" Replicas: {a.min_max}")
print(f" Note: {a.note}")
เคล็ดลับ
- Queue: ใช้ Queue Buffer ระหว่าง API กับ LLM ป้องกัน Overload
- GPU: ใช้ Time-slicing สำหรับ Embedding Share GPU ลดค่าใช้จ่าย
- Spot: ใช้ Spot GPU Instance สำหรับ Non-critical Workload
- Affinity: ตั้ง Retriever ใกล้ Vector DB ลด Network Latency
- Priority: ตั้ง PriorityClass ให้ LLM Pod สำคัญกว่า
RAG Architecture คืออะไร
Retrieval Augmented Generation Embedding Vector DB Retriever LLM Kubernetes Pod Microservices Scale GPU CPU Context Answer
Pod Scheduling ทำอย่างไร
nodeSelector Affinity Toleration topologySpread Resource Request Limit PriorityClass GPU Node CPU Node Zone HA Anti-affinity
GPU Allocation ทำอย่างไร
NVIDIA Device Plugin nvidia.com/gpu Time-slicing MIG Spot Instance ResourceQuota Dedicated Share CPU Fallback A100 VRAM
Auto-scaling ทำอย่างไร
HPA CPU GPU Utilization KEDA Queue Depth VPA Cluster Autoscaler Knative Scale-to-zero Latency Request Rate Preemptive Budget
สรุป
RAG Architecture Pod Scheduling Kubernetes GPU LLM Embedding Retriever HPA KEDA Affinity Toleration Time-slicing MIG Auto-scale Production
