Text Generation WebUI Progressive Delivery —

Text Generation WebUI คืออะไรและใช้งานอย่างไร

Text Generation WebUI (หรือ oobabooga) เป็น web interface แบบ Open Source สำหรับรัน Large Language Models (LLMs) บนเครื่องส่วนตัว รองรับ models หลายตระกูลเช่น LLaMA, Mistral, Phi, Gemma, Qwen และอื่นๆ มี UI ที่ใช้งานง่ายคล้าย ChatGPT แต่รันบนเครื่องของตัวเองทำให้ข้อมูลไม่ถูกส่งออกไปภายนอก

รองรับ backend หลายตัวสำหรับ inference ได้แก่ llama.cpp สำหรับรัน GGUF models บน CPU/GPU, ExLlamaV2 สำหรับรัน GPTQ/EXL2 models บน NVIDIA GPU ที่เร็วมาก, Transformers ที่เป็น backend มาตรฐานจาก Hugging Face และ AutoGPTQ สำหรับ quantized models

Progressive Delivery เป็นแนวคิดการ deploy software แบบค่อยๆปล่อยให้ผู้ใช้ทีละกลุ่ม เริ่มจาก canary deployment ที่ปล่อย traffic 1-5% ไปยัง model ใหม่ ตรวจสอบ metrics แล้วค่อยเพิ่ม traffic จนถึง 100% ใช้สำหรับ deploy LLM model เวอร์ชันใหม่โดยไม่กระทบผู้ใช้ทั้งหมด

การรวม Text Generation WebUI กับ Progressive Delivery ช่วยให้สามารถทดสอบ model ใหม่กับผู้ใช้จริงแบบ controlled โดยวัด metrics เช่น response quality, latency, token throughput และ user satisfaction ก่อนจะ rollout เต็มรูปแบบ

เนื้อหาเกี่ยวข้อง — บทความที่เกี่ยวข้อง: functional programming javascript book

ติดตั้ง Text Generation WebUI บน Local Machine

ติดตั้ง Text Generation WebUI พร้อม API mode สำหรับ production

# Clone repository
git clone https://github.com/oobabooga/text-generation-webui.git
cd text-generation-webui

# ติดตั้งบน Linux (NVIDIA GPU)
./start_linux.sh

# ติดตั้งบน Windows
# start_windows.bat

# ติดตั้งแบบ manual (Docker)
cat > docker-compose.yml << 'EOF'
version: '3.8'
services:
  textgen:
    image: atinoda/text-generation-webui:default-nvidia
    container_name: textgen
    environment:
      - EXTRA_LAUNCH_ARGS=--listen --api --api-port 5000 --extensions openai
    ports:
      - "7860:7860"   # WebUI
      - "5000:5000"   # API
      - "5001:5001"   # OpenAI-compatible API
    volumes:
      - ./models:/app/models
      - ./characters:/app/characters
      - ./loras:/app/loras
      - ./presets:/app/presets
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    restart: unless-stopped
EOF

docker compose up -d

# ดาวน์โหลด Model (ตัวอย่าง: Mistral 7B GGUF)
cd models
wget "https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-GGUF/resolve/main/mistral-7b-instruct-v0.2.Q4_K_M.gguf"

# หรือดาวน์โหลดผ่าน huggingface-cli
pip install huggingface-hub
huggingface-cli download TheBloke/Mistral-7B-Instruct-v0.2-GGUF \
  mistral-7b-instruct-v0.2.Q4_K_M.gguf \
  --local-dir ./models/

# เปิด WebUI พร้อม API
python server.py --listen --api --api-port 5000 \
  --model mistral-7b-instruct-v0.2.Q4_K_M.gguf \
  --n-gpu-layers 35 \
  --ctx-size 4096 \
  --threads 8

# ทดสอบ API
curl -s http://localhost:5000/api/v1/model | python3 -m json.tool
# {"result": "mistral-7b-instruct-v0.2.Q4_K_M"}

ตั้งค่า Model Loading และ Quantization

ตั้งค่า model parameters สำหรับ performance ที่ดีที่สุด

แนะนำเพิ่มเติม — เรียนเทรดกับ iCafeForex

#!/usr/bin/env python3
# model_config.py — Model Configuration สำหรับ Text Generation WebUI
import requests
import json

API_URL = "http://localhost:5000/api/v1"

# ดู models ที่มี
def list_models():
    r = requests.get(f"{API_URL}/model")
    print(f"Current model: {r.json()['result']}")
    
    r = requests.post(f"{API_URL}/internal/model/list")
    models = r.json().get("model_names", [])
    for m in models:
        print(f"  - {m}")

# Load model พร้อม parameters
def load_model(model_name, params=None):
    default_params = {
        "loader": "llama.cpp",
        "n_gpu_layers": 35,       # จำนวน layers บน GPU
        "n_ctx": 4096,            # context length
        "threads": 8,             # CPU threads
        "n_batch": 512,           # batch size สำหรับ prompt processing
        "rope_freq_base": 0,      # RoPE frequency base
        "no_mmap": False,         # disable memory mapping
        "mlock": False,           # lock model in RAM
        "flash_attn": True,       # Flash Attention (เร็วขึ้น)
    }
    
    if params:
        default_params.update(params)
    
    payload = {
        "model_name": model_name,
        "args": default_params,
    }
    
    r = requests.post(f"{API_URL}/internal/model/load", json=payload)
    print(f"Load model: {r.json()}")

# Generation parameters
def generate(prompt, max_tokens=512, temperature=0.7):
    payload = {
        "prompt": prompt,
        "max_new_tokens": max_tokens,
        "temperature": temperature,
        "top_p": 0.9,
        "top_k": 40,
        "repetition_penalty": 1.15,
        "do_sample": True,
        "seed": -1,
        "stop": ["", "\n\nUser:", "\n\nHuman:"],
    }
    
    r = requests.post(f"{API_URL}/generate", json=payload)
    result = r.json()
    return result["results"][0]["text"]

# OpenAI-compatible API (port 5001)
def generate_openai(messages, model="default", max_tokens=512):
    payload = {
        "model": model,
        "messages": messages,
        "max_tokens": max_tokens,
        "temperature": 0.7,
        "stream": False,
    }
    
    r = requests.post("http://localhost:5001/v1/chat/completions", json=payload)
    return r.json()["choices"][0]["message"]["content"]

if __name__ == "__main__":
    list_models()
    
    # ทดสอบ generation
    response = generate("What is machine learning? Explain briefly.")
    print(f"\nResponse:\n{response}")
    
    # ทดสอบ OpenAI-compatible API
    response = generate_openai([
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain Docker in 3 sentences."}
    ])
    print(f"\nOpenAI API Response:\n{response}")

Progressive Delivery สำหรับ LLM Deployment

สร้างระบบ Progressive Delivery สำหรับ deploy LLM models แบบค่อยๆ rollout

#!/usr/bin/env python3
# progressive_delivery.py — Progressive Delivery สำหรับ LLM
import random
import time
import json
import requests
from datetime import datetime
from collections import defaultdict

class LLMProgressiveDelivery:
    def __init__(self):
        self.models = {}
        self.traffic_weights = {}
        self.metrics = defaultdict(lambda: defaultdict(list))

    def register_model(self, name, endpoint, weight=0):
        self.models[name] = {"endpoint": endpoint, "healthy": True}
        self.traffic_weights[name] = weight

    def set_traffic(self, weights):
        total = sum(weights.values())
        if abs(total - 100) > 0.01:
            raise ValueError(f"Weights must sum to 100, got {total}")
        self.traffic_weights = weights

    def route_request(self, request_id=None):
        healthy_models = {
            k: v for k, v in self.traffic_weights.items()
            if self.models[k]["healthy"] and v > 0
        }
        
        if not healthy_models:
            raise RuntimeError("No healthy models available")
        
        # Weighted random selection
        total = sum(healthy_models.values())
        r = random.uniform(0, total)
        cumulative = 0
        for model_name, weight in healthy_models.items():
            cumulative += weight
            if r <= cumulative:
                return model_name
        
        return list(healthy_models.keys())[-1]

    def generate(self, prompt, request_id=None):
        model_name = self.route_request(request_id)
        endpoint = self.models[model_name]["endpoint"]
        
        start_time = time.time()
        try:
            r = requests.post(f"{endpoint}/v1/chat/completions", json={
                "model": "default",
                "messages": [{"role": "user", "content": prompt}],
                "max_tokens": 256,
                "temperature": 0.7,
            }, timeout=30)
            
            latency = time.time() - start_time
            response = r.json()
            text = response["choices"][0]["message"]["content"]
            tokens = response.get("usage", {}).get("completion_tokens", len(text.split()))
            
            # Record metrics
            self.metrics[model_name]["latency"].append(latency)
            self.metrics[model_name]["tokens"].append(tokens)
            self.metrics[model_name]["success"].append(1)
            
            return {"model": model_name, "text": text, "latency": latency, "tokens": tokens}
            
        except Exception as e:
            latency = time.time() - start_time
            self.metrics[model_name]["latency"].append(latency)
            self.metrics[model_name]["success"].append(0)
            self.models[model_name]["healthy"] = False
            
            # Fallback to another model
            return self.generate(prompt, request_id)

    def get_metrics_summary(self):
        summary = {}
        for model_name, m in self.metrics.items():
            latencies = m["latency"]
            successes = m["success"]
            tokens = m["tokens"]
            
            summary[model_name] = {
                "requests": len(latencies),
                "success_rate": sum(successes) / max(len(successes), 1) * 100,
                "avg_latency": sum(latencies) / max(len(latencies), 1),
                "p99_latency": sorted(latencies)[int(len(latencies) * 0.99)] if latencies else 0,
                "avg_tokens": sum(tokens) / max(len(tokens), 1),
                "traffic_weight": self.traffic_weights.get(model_name, 0),
            }
        return summary

    def print_status(self):
        summary = self.get_metrics_summary()
        print(f"\n{'='*60}")
        print(f"Progressive Delivery Status — {datetime.now().strftime('%H:%M:%S')}")
        print(f"{'='*60}")
        for name, s in summary.items():
            healthy = "OK" if self.models[name]["healthy"] else "DOWN"
            print(f"\n[{healthy}] {name} (traffic: {s['traffic_weight']}%)")
            print(f"  Requests: {s['requests']} | Success: {s['success_rate']:.1f}%")
            print(f"  Avg Latency: {s['avg_latency']:.2f}s | P99: {s['p99_latency']:.2f}s")
            print(f"  Avg Tokens: {s['avg_tokens']:.0f}")

# ตัวอย่างการใช้งาน
pd = LLMProgressiveDelivery()
pd.register_model("mistral-v1", "http://llm-v1:5001", weight=90)
pd.register_model("mistral-v2", "http://llm-v2:5001", weight=10)

# Canary: 90/10
pd.set_traffic({"mistral-v1": 90, "mistral-v2": 10})

# หลังตรวจสอบ metrics ดีแล้ว เพิ่มเป็น 50/50
# pd.set_traffic({"mistral-v1": 50, "mistral-v2": 50})

# Full rollout
# pd.set_traffic({"mistral-v1": 0, "mistral-v2": 100})

สร้าง API Gateway และ Load Balancing

ตั้งค่า Nginx เป็น API Gateway สำหรับ LLM services

เนื้อหาเกี่ยวข้อง — ทำความเข้าใจ Vue Composition API Team Productivity

# nginx.conf — API Gateway สำหรับ LLM Progressive Delivery
upstream llm_canary {
    server llm-v1:5001 weight=9;    # 90% traffic
    server llm-v2:5001 weight=1;    # 10% traffic
}

upstream llm_v1 {
    server llm-v1:5001;
}

upstream llm_v2 {
    server llm-v2:5001;
}

# Rate limiting
limit_req_zone $binary_remote_addr zone=llm_limit:10m rate=10r/m;
limit_conn_zone $binary_remote_addr zone=llm_conn:10m;

server {
    listen 80;
    server_name llm-api.example.com;

    # Health check endpoint
    location /health {
        return 200 '{"status": "ok"}';
        add_header Content-Type application/json;
    }

    # Canary routing (weighted)
    location /v1/chat/completions {
        limit_req zone=llm_limit burst=5 nodelay;
        limit_conn llm_conn 10;
        
        proxy_pass http://llm_canary;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_read_timeout 60s;
        proxy_send_timeout 60s;
        
        # Add model version header
        add_header X-Model-Version $upstream_addr always;
    }

    # Direct access to specific model version
    location /v1/model-v1/ {
        proxy_pass http://llm_v1/v1/;
    }

    location /v1/model-v2/ {
        proxy_pass http://llm_v2/v1/;
    }

    # Streaming support
    location /v1/chat/completions/stream {
        proxy_pass http://llm_canary/v1/chat/completions;
        proxy_set_header Connection '';
        proxy_http_version 1.1;
        chunked_transfer_encoding off;
        proxy_buffering off;
        proxy_cache off;
    }
}

# docker-compose.yml สำหรับ multi-model deployment
# version: '3.8'
# services:
#   llm-v1:
#     image: atinoda/text-generation-webui:default-nvidia
#     environment:
#       - EXTRA_LAUNCH_ARGS=--api --extensions openai --model mistral-v1.gguf
#     deploy:
#       resources:
#         reservations:
#           devices:
#             - driver: nvidia
#               device_ids: ['0']
#               capabilities: [gpu]
#
#   llm-v2:
#     image: atinoda/text-generation-webui:default-nvidia
#     environment:
#       - EXTRA_LAUNCH_ARGS=--api --extensions openai --model mistral-v2.gguf
#     deploy:
#       resources:
#         reservations:
#           devices:
#             - driver: nvidia
#               device_ids: ['1']
#               capabilities: [gpu]
#
#   nginx:
#     image: nginx:alpine
#     ports:
#       - "80:80"
#     volumes:
#       - ./nginx.conf:/etc/nginx/conf.d/default.conf
#     depends_on:
#       - llm-v1
#       - llm-v2

Monitoring และ A/B Testing สำหรับ LLM Models

สร้างระบบ monitoring สำหรับเปรียบเทียบ performance ของ LLM models

#!/usr/bin/env python3
# llm_monitor.py — LLM Model Monitoring และ Comparison
import time
import json
import requests
from datetime import datetime
from collections import defaultdict
import statistics

class LLMMonitor:
    def __init__(self):
        self.results = defaultdict(list)

    def benchmark(self, endpoint, model_name, prompts, max_tokens=256):
        print(f"\nBenchmarking: {model_name}")
        
        for prompt in prompts:
            start = time.time()
            try:
                r = requests.post(f"{endpoint}/v1/chat/completions", json={
                    "model": "default",
                    "messages": [{"role": "user", "content": prompt}],
                    "max_tokens": max_tokens,
                    "temperature": 0.7,
                }, timeout=60)
                
                latency = time.time() - start
                data = r.json()
                text = data["choices"][0]["message"]["content"]
                usage = data.get("usage", {})
                
                self.results[model_name].append({
                    "prompt": prompt[:50],
                    "response_length": len(text),
                    "latency": latency,
                    "prompt_tokens": usage.get("prompt_tokens", 0),
                    "completion_tokens": usage.get("completion_tokens", 0),
                    "tokens_per_second": usage.get("completion_tokens", 0) / max(latency, 0.01),
                    "success": True,
                })
                print(f"  OK: {latency:.2f}s, {len(text)} chars")
                
            except Exception as e:
                self.results[model_name].append({
                    "prompt": prompt[:50],
                    "latency": time.time() - start,
                    "success": False,
                    "error": str(e),
                })
                print(f"  FAIL: {e}")

    def compare(self):
        print(f"\n{'='*70}")
        print(f"Model Comparison Report — {datetime.now().strftime('%Y-%m-%d %H:%M')}")
        print(f"{'='*70}")
        
        for model_name, results in self.results.items():
            successful = [r for r in results if r["success"]]
            failed = [r for r in results if not r["success"]]
            
            if not successful:
                print(f"\n{model_name}: All requests failed")
                continue
            
            latencies = [r["latency"] for r in successful]
            tps = [r["tokens_per_second"] for r in successful]
            
            print(f"\n{model_name}:")
            print(f"  Success Rate: {len(successful)}/{len(results)} ({len(successful)/len(results)*100:.0f}%)")
            print(f"  Avg Latency: {statistics.mean(latencies):.2f}s")
            print(f"  P50 Latency: {statistics.median(latencies):.2f}s")
            print(f"  P99 Latency: {sorted(latencies)[int(len(latencies)*0.99)]:.2f}s")
            print(f"  Avg Tokens/s: {statistics.mean(tps):.1f}")
            print(f"  Avg Response Length: {statistics.mean([r['response_length'] for r in successful]):.0f} chars")

TEST_PROMPTS = [
    "Explain the concept of microservices architecture.",
    "Write a Python function to calculate Fibonacci numbers.",
    "What are the best practices for database indexing?",
    "Explain the difference between TCP and UDP.",
    "How does garbage collection work in Java?",
]

if __name__ == "__main__":
    monitor = LLMMonitor()
    monitor.benchmark("http://localhost:5001", "mistral-v1", TEST_PROMPTS)
    monitor.benchmark("http://localhost:5002", "mistral-v2", TEST_PROMPTS)
    monitor.compare()

FAQ คำถามที่พบบ่อย

Q: GPU ขั้นต่ำสำหรับรัน LLM ด้วย Text Generation WebUI คือเท่าไหร่?

A: สำหรับ model 7B parameters ด้วย 4-bit quantization ต้องการ VRAM ประมาณ 4-6GB ใช้ GTX 1660 ขึ้นไปได้ สำหรับ 13B ต้องการ 8-10GB ใช้ RTX 3060/4060 สำหรับ 70B ต้องการ 40GB+ ใช้ A100 หรือ RTX 6000 ถ้า VRAM ไม่พอสามารถใช้ llama.cpp offload บางส่วนไป CPU RAM ได้

แนะนำเพิ่มเติม — ติดตาม XM Signal

เนื้อหาเกี่ยวข้อง — แนะนำให้อ่าน TTS Coqui Capacity Planning — วางแผน Capacity

Q: GGUF กับ GPTQ กับ EXL2 ต่างกันอย่างไร?

A: GGUF เป็น format สำหรับ llama.cpp รันได้ทั้ง CPU และ GPU ยืดหยุ่นมาก GPTQ เป็น quantization method ที่รันบน GPU เท่านั้นผ่าน AutoGPTQ ส่วน EXL2 เป็น format สำหรับ ExLlamaV2 ที่เร็วที่สุดบน NVIDIA GPU สำหรับผู้เริ่มต้นแนะนำ GGUF เพราะรองรับ hardware กว้างที่สุด

Q: Progressive Delivery จำเป็นสำหรับ LLM ไหม?

A: จำเป็นมากสำหรับ production LLM เพราะ model ใหม่อาจมี regression ที่ตรวจไม่พบจาก offline evaluation เช่น ตอบคำถามบาง domain แย่ลง latency สูงขึ้น หรือ generate content ที่ไม่เหมาะสม progressive delivery ช่วยจำกัดผลกระทบโดย rollout ทีละ 10-20% พร้อม monitoring

เนื้อหาเกี่ยวข้อง — ทำความเข้าใจ Prometheus Federation Internal Developer Platform

Q: จะวัดคุณภาพ LLM output ใน production อย่างไร?

A: ใช้หลาย metrics ร่วมกันเช่น user feedback (thumbs up/down), response latency, tokens per second, error rate, toxicity score จาก content filter, semantic similarity กับ reference answers และ engagement metrics เช่น follow-up questions ไม่ควรพึ่ง metric เดียวเพราะ LLM quality เป็นเรื่องซับซ้อน