SiamCafe.net Blog
Technology

Text Generation WebUI Pod Scheduling

text generation webui pod scheduling
Text Generation WebUI Pod Scheduling | SiamCafe Blog
2025-09-25· อ. บอม — SiamCafe.net· 8,713 คำ

Text Gen WebUI Scheduling

Text Generation WebUI Oobabooga LLM Inference Kubernetes Pod Scheduling GPU NVIDIA Llama Mistral vLLM TGI Quantization GPTQ AWQ GGUF HPA Autoscaler

Inference EngineThroughputLatencyGPU Memoryเหมาะกับ
vLLMสูงมากต่ำปานกลางProduction API
TGI (HuggingFace)สูงต่ำปานกลางProduction API
Oobabooga WebUIปานกลางปานกลางสูงDev/Testing
llama.cpp (GGUF)ปานกลางปานกลางต่ำCPU/Edge

Kubernetes GPU Setup

# === GPU Pod Scheduling ===

# Install NVIDIA GPU Operator
# helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
# helm install gpu-operator nvidia/gpu-operator \
#   --namespace gpu-operator --create-namespace

# Pod with GPU — Text Generation WebUI
# apiVersion: apps/v1
# kind: Deployment
# metadata:
#   name: text-gen-webui
# spec:
#   replicas: 2
#   selector:
#     matchLabels:
#       app: text-gen-webui
#   template:
#     metadata:
#       labels:
#         app: text-gen-webui
#     spec:
#       nodeSelector:
#         accelerator: nvidia-a100
#       tolerations:
#         - key: nvidia.com/gpu
#           operator: Exists
#           effect: NoSchedule
#       containers:
#         - name: webui
#           image: atinoda/text-generation-webui:latest
#           resources:
#             requests:
#               cpu: "4"
#               memory: "16Gi"
#               nvidia.com/gpu: "1"
#             limits:
#               cpu: "8"
#               memory: "32Gi"
#               nvidia.com/gpu: "1"
#           ports:
#             - containerPort: 7860
#           volumeMounts:
#             - name: models
#               mountPath: /app/models
#       volumes:
#         - name: models
#           persistentVolumeClaim:
#             claimName: llm-models-pvc

# Pod Anti-affinity — แยก GPU Pod ไม่ให้อยู่ Node เดียวกัน
# affinity:
#   podAntiAffinity:
#     preferredDuringSchedulingIgnoredDuringExecution:
#       - weight: 100
#         podAffinityTerm:
#           labelSelector:
#             matchLabels:
#               app: text-gen-webui
#           topologyKey: kubernetes.io/hostname

from dataclasses import dataclass

@dataclass
class GPUNode:
    name: str
    gpu_type: str
    gpu_count: int
    vram_gb: int
    pods_running: int
    gpu_util_pct: float
    status: str

nodes = [
    GPUNode("gpu-node-01", "A100 80GB", 4, 320, 3, 72, "Ready"),
    GPUNode("gpu-node-02", "A100 80GB", 4, 320, 4, 85, "Ready"),
    GPUNode("gpu-node-03", "A10G 24GB", 4, 96, 2, 55, "Ready"),
    GPUNode("gpu-node-04", "T4 16GB", 4, 64, 4, 90, "Ready"),
    GPUNode("gpu-node-05", "A100 80GB", 4, 320, 0, 0, "Scaling Up"),
]

print("=== GPU Nodes ===")
total_gpus = 0
total_used = 0
for n in nodes:
    total_gpus += n.gpu_count
    total_used += n.pods_running
    print(f"  [{n.status}] {n.name}")
    print(f"    GPU: {n.gpu_type} x{n.gpu_count} | VRAM: {n.vram_gb}GB")
    print(f"    Pods: {n.pods_running} | Util: {n.gpu_util_pct}%")
print(f"\n  Total GPUs: {total_gpus} | Used: {total_used}")

Inference Optimization

# === LLM Inference Optimization ===

# vLLM — High-throughput Serving
# pip install vllm
# python -m vllm.entrypoints.openai.api_server \
#   --model mistralai/Mistral-7B-Instruct-v0.2 \
#   --gpu-memory-utilization 0.9 \
#   --max-model-len 8192 \
#   --tensor-parallel-size 1

# Docker — vLLM Server
# docker run --gpus all -p 8000:8000 \
#   vllm/vllm-openai:latest \
#   --model TheBloke/Mistral-7B-Instruct-v0.2-GPTQ \
#   --quantization gptq \
#   --max-model-len 4096

# Quantization Comparison
# Model: Llama-2-13B
# FP16:    26GB VRAM, 100% quality
# GPTQ-4:  8GB VRAM,  97% quality, 2x faster
# AWQ-4:   8GB VRAM,  97% quality, 2x faster
# GGUF-Q4: 8GB VRAM,  95% quality, CPU possible

@dataclass
class ModelConfig:
    model: str
    quantization: str
    vram_gb: float
    tokens_per_sec: int
    quality_pct: float
    max_context: int

configs = [
    ModelConfig("Llama-2-70B", "FP16", 140, 25, 100, 4096),
    ModelConfig("Llama-2-70B", "GPTQ-4bit", 40, 45, 97, 4096),
    ModelConfig("Mistral-7B", "FP16", 14, 80, 100, 32768),
    ModelConfig("Mistral-7B", "GPTQ-4bit", 5, 120, 97, 32768),
    ModelConfig("Mistral-7B", "GGUF-Q4", 5, 40, 95, 32768),
    ModelConfig("Phi-3-mini", "FP16", 8, 100, 100, 128000),
]

print("\n=== Model Configurations ===")
for c in configs:
    print(f"  [{c.model}] {c.quantization}")
    print(f"    VRAM: {c.vram_gb}GB | Speed: {c.tokens_per_sec} tok/s | Quality: {c.quality_pct}%")

Autoscaling

# === HPA + Cluster Autoscaler ===

# HPA — Scale on GPU Utilization
# apiVersion: autoscaling/v2
# kind: HorizontalPodAutoscaler
# metadata:
#   name: llm-hpa
# spec:
#   scaleTargetRef:
#     apiVersion: apps/v1
#     kind: Deployment
#     name: text-gen-webui
#   minReplicas: 2
#   maxReplicas: 10
#   metrics:
#     - type: Pods
#       pods:
#         metric:
#           name: gpu_utilization
#         target:
#           type: AverageValue
#           averageValue: "70"
#     - type: Pods
#       pods:
#         metric:
#           name: request_queue_length
#         target:
#           type: AverageValue
#           averageValue: "5"

# Karpenter — GPU Node Provisioner
# apiVersion: karpenter.sh/v1alpha5
# kind: Provisioner
# metadata:
#   name: gpu-provisioner
# spec:
#   requirements:
#     - key: node.kubernetes.io/instance-type
#       operator: In
#       values: ["p3.2xlarge", "g5.xlarge", "g5.2xlarge"]
#     - key: nvidia.com/gpu
#       operator: Exists
#   limits:
#     resources:
#       nvidia.com/gpu: "20"
#   ttlSecondsAfterEmpty: 300

scaling_metrics = {
    "Active Pods": "6 / 10 max",
    "GPU Utilization (avg)": "72%",
    "Request Queue": "3 pending",
    "Tokens/sec (total)": "480",
    "Concurrent Users": "25",
    "Avg Latency (TTFT)": "450ms",
    "GPU Nodes": "4 active, 1 scaling",
    "Monthly GPU Cost": "$8,500",
}

print("Autoscaling Dashboard:")
for k, v in scaling_metrics.items():
    print(f"  {k}: {v}")

tips = [
    "Quantization: ใช้ GPTQ/AWQ ลด VRAM 50-75%",
    "vLLM: ใช้ PagedAttention เพิ่ม Throughput 3-5x",
    "Batching: Continuous Batching รวม Request",
    "Karpenter: Auto-provision GPU Node ตาม Demand",
    "Spot Instance: ใช้ Spot สำหรับ Non-critical Inference",
    "Model Cache: PVC เก็บ Model ไม่ต้อง Download ซ้ำ",
    "DCGM: Monitor GPU Health Temperature Memory",
]

print(f"\n\nOptimization Tips:")
for i, t in enumerate(tips, 1):
    print(f"  {i}. {t}")

เคล็ดลับ

Text Generation WebUI คืออะไร

Open Source LLM WebUI Oobabooga Llama Mistral GGUF GPTQ AWQ Chat Notebook API GPU NVIDIA AMD CPU Private ไม่ส่งข้อมูลออก

Pod Scheduling บน Kubernetes คืออะไร

เลือก Node สำหรับ Pod Resource Requests Limits CPU Memory GPU Node Selector Affinity Anti-affinity Taints Tolerations

จัดการ GPU บน Kubernetes อย่างไร

NVIDIA GPU Operator Device Plugin nvidia.com/gpu Taint GPU Node MIG Time-slicing Karpenter Autoscaler DCGM Monitor

Scale LLM Inference อย่างไร

HPA GPU Utilization Queue vLLM TGI Batching KV Cache Quantization GPTQ AWQ GGUF 50-75% Multi-GPU Tensor Parallelism

สรุป

Text Generation WebUI Pod Scheduling Kubernetes GPU NVIDIA vLLM TGI Quantization GPTQ AWQ HPA Autoscaler Karpenter LLM Inference Production Scaling

📖 บทความที่เกี่ยวข้อง

Text Generation WebUI API Integration เชื่อมต่อระบบอ่านบทความ → Text Generation WebUI Incident Managementอ่านบทความ → Text Generation WebUI สำหรับมือใหม่ Step by Stepอ่านบทความ → Text Generation WebUI Code Review Best Practiceอ่านบทความ → Text Generation WebUI CI CD Automation Pipelineอ่านบทความ →

📚 ดูบทความทั้งหมด →