Text Generation WebUI Pod Scheduling — จัดการ
Text Gen WebUI Scheduling
Text Generation WebUI Oobabooga LLM Inference Kubernetes Pod Scheduling GPU NVIDIA Llama Mistral vLLM TGI Quantization GPTQ AWQ GGUF HPA Autoscaler
| Inference Engine | Throughput | Latency | GPU Memory | เหมาะกับ |
|---|---|---|---|---|
| vLLM | สูงมาก | ต่ำ | ปานกลาง | Production API |
| TGI (HuggingFace) | สูง | ต่ำ | ปานกลาง | Production API |
| Oobabooga WebUI | ปานกลาง | ปานกลาง | สูง | Dev/Testing |
| llama.cpp (GGUF) | ปานกลาง | ปานกลาง | ต่ำ | CPU/Edge |
Kubernetes GPU Setup
=== GPU Pod Scheduling ===
Install NVIDIA GPU Operator
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm install gpu-operator nvidia/gpu-operator \
--namespace gpu-operator --create-namespace
Pod with GPU — Text Generation WebUI
apiVersion: apps/v1
kind: Deployment
metadata:
name: text-gen-webui
spec:
replicas: 2
selector:
matchLabels:
app: text-gen-webui
template:
metadata:
labels:
app: text-gen-webui
spec:
nodeSelector:
accelerator: nvidia-a100
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
containers:
- name: webui
image: atinoda/text-generation-webui:latest
resources:
requests:
cpu: "4"
memory: "16Gi"
nvidia.com/gpu: "1"
limits:
cpu: "8"
memory: "32Gi"
nvidia.com/gpu: "1"
ports:
- containerPort: 7860
volumeMounts:
- name: models
mountPath: /app/models
volumes:
- name: models
persistentVolumeClaim:
claimName: llm-models-pvc
Pod Anti-affinity — แยก GPU Pod ไม่ให้อยู่ Node เดียวกัน
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchLabels:
app: text-gen-webui
topologyKey: kubernetes.io/hostname
from dataclasses import dataclass
@dataclass
class GPUNode:
name: str
gpu_type: str
gpu_count: int
vram_gb: int
pods_running: int
gpu_util_pct: float
status: str
nodes = [
GPUNode("gpu-node-01", "A100 80GB", 4, 320, 3, 72, "Ready"),
GPUNode("gpu-node-02", "A100 80GB", 4, 320, 4, 85, "Ready"),
GPUNode("gpu-node-03", "A10G 24GB", 4, 96, 2, 55, "Ready"),
GPUNode("gpu-node-04", "T4 16GB", 4, 64, 4, 90, "Ready"),
GPUNode("gpu-node-05", "A100 80GB", 4, 320, 0, 0, "Scaling Up"),
]
print("=== GPU Nodes ===")
total_gpus = 0
total_used = 0
for n in nodes:
total_gpus += n.gpu_count
total_used += n.pods_running
print(f" [{n.status}] {n.name}")
print(f" GPU: {n.gpu_type} x{n.gpu_count} | VRAM: {n.vram_gb}GB")
print(f" Pods: {n.pods_running} | Util: {n.gpu_util_pct}%")
print(f"\n Total GPUs: {total_gpus} | Used: {total_used}")
Inference Optimization
# === LLM Inference Optimization ===
# vLLM — High-throughput Serving
# pip install vllm
# python -m vllm.entrypoints.openai.api_server \
# --model mistralai/Mistral-7B-Instruct-v0.2 \
# --gpu-memory-utilization 0.9 \
# --max-model-len 8192 \
# --tensor-parallel-size 1
# Docker — vLLM Server
# docker run --gpus all -p 8000:8000 \
# vllm/vllm-openai:latest \
# --model TheBloke/Mistral-7B-Instruct-v0.2-GPTQ \
# --quantization gptq \
# --max-model-len 4096
# Quantization Comparison
# Model: Llama-2-13B
# FP16: 26GB VRAM, 100% quality
# GPTQ-4: 8GB VRAM, 97% quality, 2x faster
# AWQ-4: 8GB VRAM, 97% quality, 2x faster
# GGUF-Q4: 8GB VRAM, 95% quality, CPU possible
@dataclass
class ModelConfig:
model: str
quantization: str
vram_gb: float
tokens_per_sec: int
quality_pct: float
max_context: int
configs = [
ModelConfig("Llama-2-70B", "FP16", 140, 25, 100, 4096),
ModelConfig("Llama-2-70B", "GPTQ-4bit", 40, 45, 97, 4096),
ModelConfig("Mistral-7B", "FP16", 14, 80, 100, 32768),
ModelConfig("Mistral-7B", "GPTQ-4bit", 5, 120, 97, 32768),
ModelConfig("Mistral-7B", "GGUF-Q4", 5, 40, 95, 32768),
ModelConfig("Phi-3-mini", "FP16", 8, 100, 100, 128000),
]
print("\n=== Model Configurations ===")
for c in configs:
print(f" [{c.model}] {c.quantization}")
print(f" VRAM: {c.vram_gb}GB | Speed: {c.tokens_per_sec} tok/s | Quality: {c.quality_pct}%")
Autoscaling
=== HPA + Cluster Autoscaler ===
HPA — Scale on GPU Utilization
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: llm-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: text-gen-webui
minReplicas: 2
maxReplicas: 10
metrics:
- type: Pods
pods:
metric:
name: gpu_utilization
target:
type: AverageValue
averageValue: "70"
- type: Pods
pods:
metric:
name: request_queue_length
target:
type: AverageValue
averageValue: "5"
Karpenter — GPU Node Provisioner
apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
name: gpu-provisioner
spec:
requirements:
- key: node.kubernetes.io/instance-type
operator: In
values: ["p3.2xlarge", "g5.xlarge", "g5.2xlarge"]
- key: nvidia.com/gpu
operator: Exists
limits:
resources:
nvidia.com/gpu: "20"
ttlSecondsAfterEmpty: 300
scaling_metrics = {
"Active Pods": "6 / 10 max",
"GPU Utilization (avg)": "72%",
"Request Queue": "3 pending",
"Tokens/sec (total)": "480",
"Concurrent Users": "25",
"Avg Latency (TTFT)": "450ms",
"GPU Nodes": "4 active, 1 scaling",
"Monthly GPU Cost": "$8,500",
}
print("Autoscaling Dashboard:")
for k, v in scaling_metrics.items():
print(f" {k}: {v}")
tips = [
"Quantization: ใช้ GPTQ/AWQ ลด VRAM 50-75%",
"vLLM: ใช้ PagedAttention เพิ่ม Throughput 3-5x",
"Batching: Continuous Batching รวม Request",
"Karpenter: Auto-provision GPU Node ตาม Demand",
"Spot Instance: ใช้ Spot สำหรับ Non-critical Inference",
"Model Cache: PVC เก็บ Model ไม่ต้อง Download ซ้ำ",
"DCGM: Monitor GPU Health Temperature Memory",
]
print(f"\n\nOptimization Tips:")
for i, t in enumerate(tips, 1):
print(f" {i}. {t}")
เคล็ดลับ
- vLLM: ใช้ vLLM สำหรับ Production Inference เร็วสุด
- Quantize: GPTQ/AWQ ลด VRAM 50-75% คุณภาพ 97%
- Taint: Taint GPU Node ป้องกัน Non-GPU Pod
- PVC: เก็บ Model บน PVC ไม่ต้อง Download ทุกครั้ง
- HPA: Scale ตาม GPU Utilization + Queue Length
Text Generation WebUI คืออะไร
Open Source LLM WebUI Oobabooga Llama Mistral GGUF GPTQ AWQ Chat Notebook API GPU NVIDIA AMD CPU Private ไม่ส่งข้อมูลออก