SiamCafe · Blog
Text Generation WebUI Pod Scheduling — จัดการ
บทความ

Text Generation WebUI Pod Scheduling — จัดการ

เผยแพร่ 28 พฤษภาคม 2569

Text Gen WebUI Scheduling

Text Generation WebUI Oobabooga LLM Inference Kubernetes Pod Scheduling GPU NVIDIA Llama Mistral vLLM TGI Quantization GPTQ AWQ GGUF HPA Autoscaler

Inference EngineThroughputLatencyGPU Memoryเหมาะกับ
vLLMสูงมากต่ำปานกลางProduction API
TGI (HuggingFace)สูงต่ำปานกลางProduction API
Oobabooga WebUIปานกลางปานกลางสูงDev/Testing
llama.cpp (GGUF)ปานกลางปานกลางต่ำCPU/Edge

Kubernetes GPU Setup

=== GPU Pod Scheduling ===

Install NVIDIA GPU Operator

helm repo add nvidia https://helm.ngc.nvidia.com/nvidia

helm install gpu-operator nvidia/gpu-operator \

--namespace gpu-operator --create-namespace

Pod with GPU — Text Generation WebUI

apiVersion: apps/v1

kind: Deployment

metadata:

name: text-gen-webui

spec:

replicas: 2

selector:

matchLabels:

app: text-gen-webui

template:

metadata:

labels:

app: text-gen-webui

spec:

nodeSelector:

accelerator: nvidia-a100

tolerations:

  • key: nvidia.com/gpu

operator: Exists

effect: NoSchedule

containers:

  • name: webui

image: atinoda/text-generation-webui:latest

resources:

requests:

cpu: "4"

memory: "16Gi"

nvidia.com/gpu: "1"

limits:

cpu: "8"

memory: "32Gi"

nvidia.com/gpu: "1"

ports:

  • containerPort: 7860

volumeMounts:

  • name: models

mountPath: /app/models

volumes:

  • name: models

persistentVolumeClaim:

claimName: llm-models-pvc

Pod Anti-affinity — แยก GPU Pod ไม่ให้อยู่ Node เดียวกัน

affinity:

podAntiAffinity:

preferredDuringSchedulingIgnoredDuringExecution:

  • weight: 100

podAffinityTerm:

labelSelector:

matchLabels:

app: text-gen-webui

topologyKey: kubernetes.io/hostname

from dataclasses import dataclass

@dataclass

class GPUNode:

name: str

gpu_type: str

gpu_count: int

vram_gb: int

pods_running: int

gpu_util_pct: float

status: str

nodes = [

GPUNode("gpu-node-01", "A100 80GB", 4, 320, 3, 72, "Ready"),

GPUNode("gpu-node-02", "A100 80GB", 4, 320, 4, 85, "Ready"),

GPUNode("gpu-node-03", "A10G 24GB", 4, 96, 2, 55, "Ready"),

GPUNode("gpu-node-04", "T4 16GB", 4, 64, 4, 90, "Ready"),

GPUNode("gpu-node-05", "A100 80GB", 4, 320, 0, 0, "Scaling Up"),

]

print("=== GPU Nodes ===")

total_gpus = 0

total_used = 0

for n in nodes:

total_gpus += n.gpu_count

total_used += n.pods_running

print(f" [{n.status}] {n.name}")

print(f" GPU: {n.gpu_type} x{n.gpu_count} | VRAM: {n.vram_gb}GB")

print(f" Pods: {n.pods_running} | Util: {n.gpu_util_pct}%")

print(f"\n Total GPUs: {total_gpus} | Used: {total_used}")

Inference Optimization

# === LLM Inference Optimization ===

# vLLM — High-throughput Serving
# pip install vllm
# python -m vllm.entrypoints.openai.api_server \
#   --model mistralai/Mistral-7B-Instruct-v0.2 \
#   --gpu-memory-utilization 0.9 \
#   --max-model-len 8192 \
#   --tensor-parallel-size 1

# Docker — vLLM Server
# docker run --gpus all -p 8000:8000 \
#   vllm/vllm-openai:latest \
#   --model TheBloke/Mistral-7B-Instruct-v0.2-GPTQ \
#   --quantization gptq \
#   --max-model-len 4096

# Quantization Comparison
# Model: Llama-2-13B
# FP16:    26GB VRAM, 100% quality
# GPTQ-4:  8GB VRAM,  97% quality, 2x faster
# AWQ-4:   8GB VRAM,  97% quality, 2x faster
# GGUF-Q4: 8GB VRAM,  95% quality, CPU possible

@dataclass
class ModelConfig:
    model: str
    quantization: str
    vram_gb: float
    tokens_per_sec: int
    quality_pct: float
    max_context: int

configs = [
    ModelConfig("Llama-2-70B", "FP16", 140, 25, 100, 4096),
    ModelConfig("Llama-2-70B", "GPTQ-4bit", 40, 45, 97, 4096),
    ModelConfig("Mistral-7B", "FP16", 14, 80, 100, 32768),
    ModelConfig("Mistral-7B", "GPTQ-4bit", 5, 120, 97, 32768),
    ModelConfig("Mistral-7B", "GGUF-Q4", 5, 40, 95, 32768),
    ModelConfig("Phi-3-mini", "FP16", 8, 100, 100, 128000),
]

print("\n=== Model Configurations ===")
for c in configs:
    print(f"  [{c.model}] {c.quantization}")
    print(f"    VRAM: {c.vram_gb}GB | Speed: {c.tokens_per_sec} tok/s | Quality: {c.quality_pct}%")

Autoscaling

=== HPA + Cluster Autoscaler ===

HPA — Scale on GPU Utilization

apiVersion: autoscaling/v2

kind: HorizontalPodAutoscaler

metadata:

name: llm-hpa

spec:

scaleTargetRef:

apiVersion: apps/v1

kind: Deployment

name: text-gen-webui

minReplicas: 2

maxReplicas: 10

metrics:

  • type: Pods

pods:

metric:

name: gpu_utilization

target:

type: AverageValue

averageValue: "70"

  • type: Pods

pods:

metric:

name: request_queue_length

target:

type: AverageValue

averageValue: "5"

Karpenter — GPU Node Provisioner

apiVersion: karpenter.sh/v1alpha5

kind: Provisioner

metadata:

name: gpu-provisioner

spec:

requirements:

  • key: node.kubernetes.io/instance-type

operator: In

values: ["p3.2xlarge", "g5.xlarge", "g5.2xlarge"]

  • key: nvidia.com/gpu

operator: Exists

limits:

resources:

nvidia.com/gpu: "20"

ttlSecondsAfterEmpty: 300

scaling_metrics = {

"Active Pods": "6 / 10 max",

"GPU Utilization (avg)": "72%",

"Request Queue": "3 pending",

"Tokens/sec (total)": "480",

"Concurrent Users": "25",

"Avg Latency (TTFT)": "450ms",

"GPU Nodes": "4 active, 1 scaling",

"Monthly GPU Cost": "$8,500",

}

print("Autoscaling Dashboard:")

for k, v in scaling_metrics.items():

print(f" {k}: {v}")

tips = [

"Quantization: ใช้ GPTQ/AWQ ลด VRAM 50-75%",

"vLLM: ใช้ PagedAttention เพิ่ม Throughput 3-5x",

"Batching: Continuous Batching รวม Request",

"Karpenter: Auto-provision GPU Node ตาม Demand",

"Spot Instance: ใช้ Spot สำหรับ Non-critical Inference",

"Model Cache: PVC เก็บ Model ไม่ต้อง Download ซ้ำ",

"DCGM: Monitor GPU Health Temperature Memory",

]

print(f"\n\nOptimization Tips:")

for i, t in enumerate(tips, 1):

print(f" {i}. {t}")

เคล็ดลับ

  • vLLM: ใช้ vLLM สำหรับ Production Inference เร็วสุด
  • Quantize: GPTQ/AWQ ลด VRAM 50-75% คุณภาพ 97%
  • Taint: Taint GPU Node ป้องกัน Non-GPU Pod
  • PVC: เก็บ Model บน PVC ไม่ต้อง Download ทุกครั้ง
  • HPA: Scale ตาม GPU Utilization + Queue Length

Text Generation WebUI คืออะไร

Open Source LLM WebUI Oobabooga Llama Mistral GGUF GPTQ AWQ Chat Notebook API GPU NVIDIA AMD CPU Private ไม่ส่งข้อมูลออก