Text Generation WebUI Pod Scheduling — จัดการ

Text Gen WebUI Scheduling

Text Generation WebUI Oobabooga LLM Inference Kubernetes Pod Scheduling GPU NVIDIA Llama Mistral vLLM TGI Quantization GPTQ AWQ GGUF HPA Autoscaler

Inference Engine	Throughput	Latency	GPU Memory	เหมาะกับ
vLLM	สูงมาก	ต่ำ	ปานกลาง	Production API
TGI (HuggingFace)	สูง	ต่ำ	ปานกลาง	Production API
Oobabooga WebUI	ปานกลาง	ปานกลาง	สูง	Dev/Testing
llama.cpp (GGUF)	ปานกลาง	ปานกลาง	ต่ำ	CPU/Edge

Kubernetes GPU Setup

=== GPU Pod Scheduling ===

Install NVIDIA GPU Operator

helm repo add nvidia https://helm.ngc.nvidia.com/nvidia

helm install gpu-operator nvidia/gpu-operator \

--namespace gpu-operator --create-namespace

Pod with GPU — Text Generation WebUI

apiVersion: apps/v1

kind: Deployment

metadata:

spec:

replicas: 2

selector:

matchLabels:

app: text-gen-webui

template:

metadata:

labels:

app: text-gen-webui

spec:

nodeSelector:

accelerator: nvidia-a100

tolerations:

key: nvidia.com/gpu

operator: Exists

effect: NoSchedule

containers:

name: webui

image: atinoda/text-generation-webui:latest

resources:

requests:

cpu: "4"

memory: "16Gi"

nvidia.com/gpu: "1"

limits:

cpu: "8"

เนื้อหาเกี่ยวข้อง — แนะนำให้อ่าน swift code ธนาคารกรุงไทย

memory: "32Gi"

nvidia.com/gpu: "1"

ports:

containerPort: 7860

volumeMounts:

name: models

mountPath: /app/models

volumes:

name: models

persistentVolumeClaim:

claimName: llm-models-pvc

Pod Anti-affinity — แยก GPU Pod ไม่ให้อยู่ Node เดียวกัน

affinity:

podAntiAffinity:

preferredDuringSchedulingIgnoredDuringExecution:

weight: 100

podAffinityTerm:

labelSelector:

matchLabels:

app: text-gen-webui

แนะนำเพิ่มเติม — แหล่งความรู้ Forex iCafeForex

topologyKey: kubernetes.io/hostname

from dataclasses import dataclass

@dataclass

class GPUNode:

gpu_type: str

gpu_count: int

vram_gb: int

pods_running: int

gpu_util_pct: float

status: str

nodes = [

GPUNode("gpu-node-01", "A100 80GB", 4, 320, 3, 72, "Ready"),

GPUNode("gpu-node-02", "A100 80GB", 4, 320, 4, 85, "Ready"),

GPUNode("gpu-node-03", "A10G 24GB", 4, 96, 2, 55, "Ready"),

GPUNode("gpu-node-04", "T4 16GB", 4, 64, 4, 90, "Ready"),

เนื้อหาเกี่ยวข้อง — แนะนำให้อ่าน trailing stop xtb — ข้อมูลครบถ้วน 2026

GPUNode("gpu-node-05", "A100 80GB", 4, 320, 0, 0, "Scaling Up"),

]

print("=== GPU Nodes ===")

total_gpus = 0

total_used = 0

for n in nodes:

total_gpus += n.gpu_count

total_used += n.pods_running

print(f" [{n.status}] {n.name}")

print(f" GPU: {n.gpu_type} x{n.gpu_count} | VRAM: {n.vram_gb}GB")

print(f" Pods: {n.pods_running} | Util: {n.gpu_util_pct}%")

print(f"\n Total GPUs: {total_gpus} | Used: {total_used}")

แนะนำเพิ่มเติม — คู่มือเทรดจาก SiamCafeBook

Inference Optimization

# === LLM Inference Optimization ===

# vLLM — High-throughput Serving
# pip install vllm
# python -m vllm.entrypoints.openai.api_server \
#   --model mistralai/Mistral-7B-Instruct-v0.2 \
#   --gpu-memory-utilization 0.9 \
#   --max-model-len 8192 \
#   --tensor-parallel-size 1

# Docker — vLLM Server
# docker run --gpus all -p 8000:8000 \
#   vllm/vllm-openai:latest \
#   --model TheBloke/Mistral-7B-Instruct-v0.2-GPTQ \
#   --quantization gptq \
#   --max-model-len 4096

# Quantization Comparison
# Model: Llama-2-13B
# FP16:    26GB VRAM, 100% quality
# GPTQ-4:  8GB VRAM,  97% quality, 2x faster
# AWQ-4:   8GB VRAM,  97% quality, 2x faster
# GGUF-Q4: 8GB VRAM,  95% quality, CPU possible

@dataclass
class ModelConfig:
    model: str
    quantization: str
    vram_gb: float
    tokens_per_sec: int
    quality_pct: float
    max_context: int

configs = [
    ModelConfig("Llama-2-70B", "FP16", 140, 25, 100, 4096),
    ModelConfig("Llama-2-70B", "GPTQ-4bit", 40, 45, 97, 4096),
    ModelConfig("Mistral-7B", "FP16", 14, 80, 100, 32768),
    ModelConfig("Mistral-7B", "GPTQ-4bit", 5, 120, 97, 32768),
    ModelConfig("Mistral-7B", "GGUF-Q4", 5, 40, 95, 32768),
    ModelConfig("Phi-3-mini", "FP16", 8, 100, 100, 128000),
]

print("\n=== Model Configurations ===")
for c in configs:
    print(f"  [{c.model}] {c.quantization}")
    print(f"    VRAM: {c.vram_gb}GB | Speed: {c.tokens_per_sec} tok/s | Quality: {c.quality_pct}%")

Autoscaling

=== HPA + Cluster Autoscaler ===

HPA — Scale on GPU Utilization

apiVersion: autoscaling/v2

kind: HorizontalPodAutoscaler

metadata:

spec:

scaleTargetRef:

apiVersion: apps/v1

kind: Deployment

minReplicas: 2

เนื้อหาเกี่ยวข้อง — แนะนำให้อ่าน วอลมเสียง

maxReplicas: 10

metrics:

type: Pods

pods:

metric:

target:

type: AverageValue

averageValue: "70"

type: Pods

pods:

metric:

target:

type: AverageValue

averageValue: "5"

Karpenter — GPU Node Provisioner

apiVersion: karpenter.sh/v1alpha5

kind: Provisioner

metadata:

spec:

requirements:

key: node.kubernetes.io/instance-type

operator: In

values: ["p3.2xlarge", "g5.xlarge", "g5.2xlarge"]

key: nvidia.com/gpu

operator: Exists

limits:

resources:

nvidia.com/gpu: "20"

ttlSecondsAfterEmpty: 300

scaling_metrics = {

"Active Pods": "6 / 10 max",

"GPU Utilization (avg)": "72%",

"Request Queue": "3 pending",

"Tokens/sec (total)": "480",

เนื้อหาเกี่ยวข้อง — Proxmox VE Cluster Data Pipeline ETL

"Concurrent Users": "25",

"Avg Latency (TTFT)": "450ms",

"GPU Nodes": "4 active, 1 scaling",

"Monthly GPU Cost": "$8,500",

}

print("Autoscaling Dashboard:")

for k, v in scaling_metrics.items():

print(f" {k}: {v}")

tips = [

"Quantization: ใช้ GPTQ/AWQ ลด VRAM 50-75%",

"vLLM: ใช้ PagedAttention เพิ่ม Throughput 3-5x",

"Batching: Continuous Batching รวม Request",

"Karpenter: Auto-provision GPU Node ตาม Demand",

"Spot Instance: ใช้ Spot สำหรับ Non-critical Inference",

"Model Cache: PVC เก็บ Model ไม่ต้อง Download ซ้ำ",

"DCGM: Monitor GPU Health Temperature Memory",

]

print(f"\n\nOptimization Tips:")

for i, t in enumerate(tips, 1):

print(f" {i}. {t}")

เคล็ดลับ

vLLM: ใช้ vLLM สำหรับ Production Inference เร็วสุด
Quantize: GPTQ/AWQ ลด VRAM 50-75% คุณภาพ 97%
Taint: Taint GPU Node ป้องกัน Non-GPU Pod
PVC: เก็บ Model บน PVC ไม่ต้อง Download ทุกครั้ง
HPA: Scale ตาม GPU Utilization + Queue Length

Text Generation WebUI คืออะไร

Open Source LLM WebUI Oobabooga Llama Mistral GGUF GPTQ AWQ Chat Notebook API GPU NVIDIA AMD CPU Private ไม่ส่งข้อมูลออก

Text Generation WebUI Pod Scheduling — จัดการ

Text Gen WebUI Scheduling

Kubernetes GPU Setup

metadata:

spec:

selector:

matchLabels:

template:

metadata:

labels:

spec:

nodeSelector:

tolerations:

containers:

resources:

requests:

limits:

ports:

volumeMounts:

volumes:

persistentVolumeClaim:

affinity:

podAntiAffinity:

preferredDuringSchedulingIgnoredDuringExecution:

podAffinityTerm:

labelSelector:

matchLabels:

class GPUNode:

for n in nodes:

Inference Optimization

Autoscaling

metadata:

spec:

scaleTargetRef:

metrics:

pods:

metric:

target:

pods:

metric:

target:

metadata:

spec:

requirements:

limits:

resources:

for k, v in scaling_metrics.items():

for i, t in enumerate(tips, 1):

เคล็ดลับ

Text Generation WebUI คืออะไร

บทความที่เกี่ยวข้อง

แนะนำจากเครือข่าย SiamCafe

บทความที่เกี่ยวข้อง