Text Gen WebUI Scheduling
Text Generation WebUI Oobabooga LLM Inference Kubernetes Pod Scheduling GPU NVIDIA Llama Mistral vLLM TGI Quantization GPTQ AWQ GGUF HPA Autoscaler
| Inference Engine | Throughput | Latency | GPU Memory | เหมาะกับ |
|---|---|---|---|---|
| vLLM | สูงมาก | ต่ำ | ปานกลาง | Production API |
| TGI (HuggingFace) | สูง | ต่ำ | ปานกลาง | Production API |
| Oobabooga WebUI | ปานกลาง | ปานกลาง | สูง | Dev/Testing |
| llama.cpp (GGUF) | ปานกลาง | ปานกลาง | ต่ำ | CPU/Edge |
Kubernetes GPU Setup
# === GPU Pod Scheduling ===
# Install NVIDIA GPU Operator
# helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
# helm install gpu-operator nvidia/gpu-operator \
# --namespace gpu-operator --create-namespace
# Pod with GPU — Text Generation WebUI
# apiVersion: apps/v1
# kind: Deployment
# metadata:
# name: text-gen-webui
# spec:
# replicas: 2
# selector:
# matchLabels:
# app: text-gen-webui
# template:
# metadata:
# labels:
# app: text-gen-webui
# spec:
# nodeSelector:
# accelerator: nvidia-a100
# tolerations:
# - key: nvidia.com/gpu
# operator: Exists
# effect: NoSchedule
# containers:
# - name: webui
# image: atinoda/text-generation-webui:latest
# resources:
# requests:
# cpu: "4"
# memory: "16Gi"
# nvidia.com/gpu: "1"
# limits:
# cpu: "8"
# memory: "32Gi"
# nvidia.com/gpu: "1"
# ports:
# - containerPort: 7860
# volumeMounts:
# - name: models
# mountPath: /app/models
# volumes:
# - name: models
# persistentVolumeClaim:
# claimName: llm-models-pvc
# Pod Anti-affinity — แยก GPU Pod ไม่ให้อยู่ Node เดียวกัน
# affinity:
# podAntiAffinity:
# preferredDuringSchedulingIgnoredDuringExecution:
# - weight: 100
# podAffinityTerm:
# labelSelector:
# matchLabels:
# app: text-gen-webui
# topologyKey: kubernetes.io/hostname
from dataclasses import dataclass
@dataclass
class GPUNode:
name: str
gpu_type: str
gpu_count: int
vram_gb: int
pods_running: int
gpu_util_pct: float
status: str
nodes = [
GPUNode("gpu-node-01", "A100 80GB", 4, 320, 3, 72, "Ready"),
GPUNode("gpu-node-02", "A100 80GB", 4, 320, 4, 85, "Ready"),
GPUNode("gpu-node-03", "A10G 24GB", 4, 96, 2, 55, "Ready"),
GPUNode("gpu-node-04", "T4 16GB", 4, 64, 4, 90, "Ready"),
GPUNode("gpu-node-05", "A100 80GB", 4, 320, 0, 0, "Scaling Up"),
]
print("=== GPU Nodes ===")
total_gpus = 0
total_used = 0
for n in nodes:
total_gpus += n.gpu_count
total_used += n.pods_running
print(f" [{n.status}] {n.name}")
print(f" GPU: {n.gpu_type} x{n.gpu_count} | VRAM: {n.vram_gb}GB")
print(f" Pods: {n.pods_running} | Util: {n.gpu_util_pct}%")
print(f"\n Total GPUs: {total_gpus} | Used: {total_used}")
Inference Optimization
# === LLM Inference Optimization ===
# vLLM — High-throughput Serving
# pip install vllm
# python -m vllm.entrypoints.openai.api_server \
# --model mistralai/Mistral-7B-Instruct-v0.2 \
# --gpu-memory-utilization 0.9 \
# --max-model-len 8192 \
# --tensor-parallel-size 1
# Docker — vLLM Server
# docker run --gpus all -p 8000:8000 \
# vllm/vllm-openai:latest \
# --model TheBloke/Mistral-7B-Instruct-v0.2-GPTQ \
# --quantization gptq \
# --max-model-len 4096
# Quantization Comparison
# Model: Llama-2-13B
# FP16: 26GB VRAM, 100% quality
# GPTQ-4: 8GB VRAM, 97% quality, 2x faster
# AWQ-4: 8GB VRAM, 97% quality, 2x faster
# GGUF-Q4: 8GB VRAM, 95% quality, CPU possible
@dataclass
class ModelConfig:
model: str
quantization: str
vram_gb: float
tokens_per_sec: int
quality_pct: float
max_context: int
configs = [
ModelConfig("Llama-2-70B", "FP16", 140, 25, 100, 4096),
ModelConfig("Llama-2-70B", "GPTQ-4bit", 40, 45, 97, 4096),
ModelConfig("Mistral-7B", "FP16", 14, 80, 100, 32768),
ModelConfig("Mistral-7B", "GPTQ-4bit", 5, 120, 97, 32768),
ModelConfig("Mistral-7B", "GGUF-Q4", 5, 40, 95, 32768),
ModelConfig("Phi-3-mini", "FP16", 8, 100, 100, 128000),
]
print("\n=== Model Configurations ===")
for c in configs:
print(f" [{c.model}] {c.quantization}")
print(f" VRAM: {c.vram_gb}GB | Speed: {c.tokens_per_sec} tok/s | Quality: {c.quality_pct}%")
Autoscaling
# === HPA + Cluster Autoscaler ===
# HPA — Scale on GPU Utilization
# apiVersion: autoscaling/v2
# kind: HorizontalPodAutoscaler
# metadata:
# name: llm-hpa
# spec:
# scaleTargetRef:
# apiVersion: apps/v1
# kind: Deployment
# name: text-gen-webui
# minReplicas: 2
# maxReplicas: 10
# metrics:
# - type: Pods
# pods:
# metric:
# name: gpu_utilization
# target:
# type: AverageValue
# averageValue: "70"
# - type: Pods
# pods:
# metric:
# name: request_queue_length
# target:
# type: AverageValue
# averageValue: "5"
# Karpenter — GPU Node Provisioner
# apiVersion: karpenter.sh/v1alpha5
# kind: Provisioner
# metadata:
# name: gpu-provisioner
# spec:
# requirements:
# - key: node.kubernetes.io/instance-type
# operator: In
# values: ["p3.2xlarge", "g5.xlarge", "g5.2xlarge"]
# - key: nvidia.com/gpu
# operator: Exists
# limits:
# resources:
# nvidia.com/gpu: "20"
# ttlSecondsAfterEmpty: 300
scaling_metrics = {
"Active Pods": "6 / 10 max",
"GPU Utilization (avg)": "72%",
"Request Queue": "3 pending",
"Tokens/sec (total)": "480",
"Concurrent Users": "25",
"Avg Latency (TTFT)": "450ms",
"GPU Nodes": "4 active, 1 scaling",
"Monthly GPU Cost": "$8,500",
}
print("Autoscaling Dashboard:")
for k, v in scaling_metrics.items():
print(f" {k}: {v}")
tips = [
"Quantization: ใช้ GPTQ/AWQ ลด VRAM 50-75%",
"vLLM: ใช้ PagedAttention เพิ่ม Throughput 3-5x",
"Batching: Continuous Batching รวม Request",
"Karpenter: Auto-provision GPU Node ตาม Demand",
"Spot Instance: ใช้ Spot สำหรับ Non-critical Inference",
"Model Cache: PVC เก็บ Model ไม่ต้อง Download ซ้ำ",
"DCGM: Monitor GPU Health Temperature Memory",
]
print(f"\n\nOptimization Tips:")
for i, t in enumerate(tips, 1):
print(f" {i}. {t}")
เคล็ดลับ
- vLLM: ใช้ vLLM สำหรับ Production Inference เร็วสุด
- Quantize: GPTQ/AWQ ลด VRAM 50-75% คุณภาพ 97%
- Taint: Taint GPU Node ป้องกัน Non-GPU Pod
- PVC: เก็บ Model บน PVC ไม่ต้อง Download ทุกครั้ง
- HPA: Scale ตาม GPU Utilization + Queue Length
Text Generation WebUI คืออะไร
Open Source LLM WebUI Oobabooga Llama Mistral GGUF GPTQ AWQ Chat Notebook API GPU NVIDIA AMD CPU Private ไม่ส่งข้อมูลออก
Pod Scheduling บน Kubernetes คืออะไร
เลือก Node สำหรับ Pod Resource Requests Limits CPU Memory GPU Node Selector Affinity Anti-affinity Taints Tolerations
จัดการ GPU บน Kubernetes อย่างไร
NVIDIA GPU Operator Device Plugin nvidia.com/gpu Taint GPU Node MIG Time-slicing Karpenter Autoscaler DCGM Monitor
Scale LLM Inference อย่างไร
HPA GPU Utilization Queue vLLM TGI Batching KV Cache Quantization GPTQ AWQ GGUF 50-75% Multi-GPU Tensor Parallelism
สรุป
Text Generation WebUI Pod Scheduling Kubernetes GPU NVIDIA vLLM TGI Quantization GPTQ AWQ HPA Autoscaler Karpenter LLM Inference Production Scaling
