Text Generation WebUI กับ SaaS Architecture —

Text Generation WebUI คืออะไร

Text Generation WebUI เป็น Open-source Web Interface สำหรับรัน Large Language Model (LLM) แบบ Self-hosted พัฒนาโดย oobabooga รองรับ Model หลากหลายเช่น Llama 3, Mistral, Phi-3, Qwen และ Gemma ผ่าน Backend หลายตัวเช่น llama.cpp, ExLlamaV2, Transformers และ vLLM ผู้ใช้สามารถ Chat กับ AI, ปรับ Parameters, สลับ Model และใช้ Extensions ต่างๆผ่าน Browser

เนื้อหาเกี่ยวข้อง — อ่านต่อ: company web design — ข้อมูลครบถ้วน 2026

เมื่อต้องการให้บริการ Text Generation ให้ผู้ใช้หลายคน เช่น ภายในองค์กรหรือเป็น SaaS Product ต้องออกแบบ Architecture ที่รองรับ Multi-tenant, GPU Resource Management, Authentication, Rate Limiting และ Billing ซึ่งมีความซับซ้อนมากกว่าการรันบนเครื่องส่วนตัว

เนื้อหาเกี่ยวข้อง — ดูเพิ่มเติมเรื่อง SSD PCIe M.2 คือ — เปรียบเทียบ NVMe SATA พร้อม

SaaS Architecture Overview

API Gateway (Kong/Nginx): จัดการ Authentication, Rate Limiting, Load Balancing
Auth Service: JWT Token, API Key Management, Tenant Isolation
Request Queue (Redis/RabbitMQ): จัดคิว Inference Requests ตาม Priority
GPU Scheduler: จัดสรร GPU ให้แต่ละ Request ตาม Tenant Plan
Inference Workers (vLLM/TGI): รัน LLM Inference จริง บน GPU Nodes
Model Registry: จัดการ Model Versions, Hot-swap Models
Usage Tracker: บันทึก Token Usage สำหรับ Billing
Billing Service: คำนวณค่าบริการตาม Token Usage

การตั้งค่า vLLM เป็น Inference Backend

# docker-compose.yml สำหรับ Text Generation SaaS
version: "3.8"

services:
  # API Gateway
  kong:
    image: kong:3.5
    environment:
      KONG_DATABASE: "off"
      KONG_DECLARATIVE_CONFIG: /kong/kong.yml
      KONG_PROXY_LISTEN: "0.0.0.0:8000, 0.0.0.0:8443 ssl"
    volumes:
      - ./kong.yml:/kong/kong.yml
    ports:
      - "8000:8000"
      - "8443:8443"

  # Redis สำหรับ Queue และ Rate Limiting
  redis:
    image: redis:7-alpine
    command: redis-server --maxmemory 2gb --maxmemory-policy allkeys-lru
    volumes:
      - redis_data:/data

  # vLLM Inference Server (GPU Node 1)
  vllm-worker-1:
    image: vllm/vllm-openai:latest
    command: >
      --model meta-llama/Llama-3.1-8B-Instruct
      --max-model-len 8192
      --gpu-memory-utilization 0.90
      --max-num-seqs 32
      --tensor-parallel-size 1
      --port 8080
      --api-key 
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    environment:
      HUGGING_FACE_HUB_TOKEN: 

  # vLLM Inference Server (GPU Node 2 — Model อื่น)
  vllm-worker-2:
    image: vllm/vllm-openai:latest
    command: >
      --model mistralai/Mistral-7B-Instruct-v0.3
      --max-model-len 8192
      --gpu-memory-utilization 0.90
      --max-num-seqs 32
      --port 8080
      --api-key 
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    environment:
      HUGGING_FACE_HUB_TOKEN: 

  # Auth + API Service
  api-server:
    build: ./api-server
    environment:
      REDIS_URL: redis://redis:6379
      DATABASE_URL: postgresql://app:@postgres:5432/saas
      JWT_SECRET: 
      VLLM_ENDPOINTS: "vllm-worker-1:8080, vllm-worker-2:8080"
    depends_on:
      - redis
      - postgres

  # PostgreSQL สำหรับ User/Billing Data
  postgres:
    image: postgres:16-alpine
    environment:
      POSTGRES_DB: saas
      POSTGRES_USER: app
      POSTGRES_PASSWORD: 
    volumes:
      - postgres_data:/var/lib/postgresql/data

volumes:
  redis_data:
  postgres_data:

---
# kong.yml — API Gateway Configuration
_format_version: "3.0"

services:
  - name: llm-api
    url: http://api-server:3000
    routes:
      - name: llm-route
        paths:
          - /v1/chat/completions
          - /v1/completions
          - /v1/models
        strip_path: false

plugins:
  - name: rate-limiting
    config:
      minute: 60
      policy: redis
      redis_host: redis
      redis_port: 6379

  - name: key-auth
    config:
      key_names:
        - Authorization
        - X-API-Key

  - name: cors
    config:
      origins:
        - "*"
      methods:
        - GET
        - POST
        - OPTIONS
      headers:
        - Authorization
        - Content-Type

API Server กับ Multi-tenant Logic

# api-server/main.py — FastAPI Server สำหรับ Multi-tenant LLM
from fastapi import FastAPI, HTTPException, Depends, Header
from pydantic import BaseModel
import httpx
import asyncio
import redis.asyncio as redis
import json
import time
from datetime import datetime

app = FastAPI(title="LLM SaaS API")

# Redis Connection
redis_client = redis.from_url("redis://redis:6379", decode_responses=True)

# vLLM Endpoints
VLLM_ENDPOINTS = {
    "llama-3.1-8b": "http://vllm-worker-1:8080",
    "mistral-7b": "http://vllm-worker-2:8080",
}

class ChatRequest(BaseModel):
    model: str = "llama-3.1-8b"
    messages: list[dict]
    max_tokens: int = 1024
    temperature: float = 0.7
    stream: bool = False

class TenantInfo:
    def __init__(self, tenant_id, plan, rate_limit, token_limit):
        self.tenant_id = tenant_id
        self.plan = plan
        self.rate_limit = rate_limit
        self.token_limit = token_limit

async def authenticate(authorization: str = Header(...)) -> TenantInfo:
    """Authenticate API Key และดึง Tenant Info"""
    api_key = authorization.replace("Bearer ", "")
    tenant_data = await redis_client.hgetall(f"apikey:{api_key}")
    if not tenant_data:
        raise HTTPException(status_code=401, detail="Invalid API key")
    return TenantInfo(
        tenant_id=tenant_data["tenant_id"],
        plan=tenant_data["plan"],
        rate_limit=int(tenant_data.get("rate_limit", 60)),
        token_limit=int(tenant_data.get("token_limit", 100000)),
    )

async def check_rate_limit(tenant: TenantInfo):
    """ตรวจสอบ Rate Limit ต่อนาที"""
    key = f"ratelimit:{tenant.tenant_id}:{int(time.time()) // 60}"
    count = await redis_client.incr(key)
    await redis_client.expire(key, 120)
    if count > tenant.rate_limit:
        raise HTTPException(status_code=429, detail="Rate limit exceeded")

async def track_usage(tenant_id: str, model: str, prompt_tokens: int, completion_tokens: int):
    """บันทึก Token Usage สำหรับ Billing"""
    today = datetime.now().strftime("%Y-%m-%d")
    usage_key = f"usage:{tenant_id}:{today}"
    await redis_client.hincrby(usage_key, "prompt_tokens", prompt_tokens)
    await redis_client.hincrby(usage_key, "completion_tokens", completion_tokens)
    await redis_client.hincrby(usage_key, "requests", 1)
    await redis_client.expire(usage_key, 86400 * 90)  # เก็บ 90 วัน

@app.post("/v1/chat/completions")
async def chat_completions(
    request: ChatRequest,
    tenant: TenantInfo = Depends(authenticate),
):
    await check_rate_limit(tenant)

    # เลือก vLLM Endpoint ตาม Model
    endpoint = VLLM_ENDPOINTS.get(request.model)
    if not endpoint:
        raise HTTPException(status_code=400, detail=f"Model {request.model} not available")

    # Forward Request ไป vLLM
    async with httpx.AsyncClient(timeout=120.0) as client:
        resp = await client.post(
            f"{endpoint}/v1/chat/completions",
            json=request.model_dump(),
            headers={"Authorization": f"Bearer {VLLM_API_KEY}"},
        )

    result = resp.json()

    # Track Usage
    usage = result.get("usage", {})
    await track_usage(
        tenant.tenant_id,
        request.model,
        usage.get("prompt_tokens", 0),
        usage.get("completion_tokens", 0),
    )

    return result

@app.get("/v1/models")
async def list_models(tenant: TenantInfo = Depends(authenticate)):
    """List Available Models ตาม Tenant Plan"""
    models = list(VLLM_ENDPOINTS.keys())
    if tenant.plan == "free":
        models = [m for m in models if "8b" in m.lower()]
    return {"data": [{"id": m, "object": "model"} for m in models]}

GPU Scheduling และ Autoscaling

# kubernetes/gpu-autoscaler.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-llama
  namespace: llm-saas
spec:
  replicas: 2
  selector:
    matchLabels:
      app: vllm-llama
  template:
    metadata:
      labels:
        app: vllm-llama
    spec:
      containers:
        - name: vllm
          image: vllm/vllm-openai:latest
          args:
            - "--model"
            - "meta-llama/Llama-3.1-8B-Instruct"
            - "--max-model-len"
            - "8192"
            - "--gpu-memory-utilization"
            - "0.90"
            - "--max-num-seqs"
            - "32"
          resources:
            limits:
              nvidia.com/gpu: 1
              memory: 32Gi
            requests:
              nvidia.com/gpu: 1
              memory: 24Gi
          ports:
            - containerPort: 8000
          readinessProbe:
            httpGet:
              path: /health
              port: 8000
            initialDelaySeconds: 120
            periodSeconds: 10
      nodeSelector:
        gpu-type: a100
      tolerations:
        - key: nvidia.com/gpu
          operator: Exists
          effect: NoSchedule
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: vllm-llama-hpa
  namespace: llm-saas
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: vllm-llama
  minReplicas: 1
  maxReplicas: 8
  metrics:
    - type: Pods
      pods:
        metric:
          name: vllm_num_requests_running
        target:
          type: AverageValue
          averageValue: "20"
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 80
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
        - type: Pods
          value: 2
          periodSeconds: 120
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
        - type: Pods
          value: 1
          periodSeconds: 300

Monitoring และ Cost Management

GPU Utilization: ใช้ NVIDIA DCGM Exporter + Prometheus ติดตาม GPU Memory, Compute Utilization และ Temperature
Request Metrics: ติดตาม Requests/sec, Latency P50/P95/P99, Queue Length, Error Rate แยกตาม Model และ Tenant
Token Throughput: วัด Tokens/sec สำหรับทั้ง Input และ Output เพื่อ Capacity Planning
Cost per Token: คำนวณต้นทุนต่อ Token จาก GPU Cost + Infra Cost หารด้วย Total Tokens เพื่อกำหนดราคา
Tenant Usage Dashboard: แสดง Token Usage, Request Count และ Cost ต่อ Tenant ต่อวัน

Text Generation WebUI คืออะไร

Text Generation WebUI เป็น Open-source Web Interface สำหรับรัน LLM แบบ Local รองรับ Model เช่น Llama, Mistral, Phi ผ่าน Backend หลายตัว ให้ผู้ใช้ Chat กับ AI ผ่าน Browser โดยไม่ต้องส่งข้อมูลไป Cloud เหมาะสำหรับองค์กรที่ต้องการ Data Privacy

แนะนำเพิ่มเติม — อีบุ๊กการลงทุน SiamCafeBook

เนื้อหาเกี่ยวข้อง — แนะนำให้อ่าน การสร้าง smart contract ข้อใดกล่าวถูกต้อง

Text Generation WebUI กับ SaaS Architecture —

Text Generation WebUI คืออะไร

SaaS Architecture Overview

การตั้งค่า vLLM เป็น Inference Backend

API Server กับ Multi-tenant Logic

GPU Scheduling และ Autoscaling

Monitoring และ Cost Management

Text Generation WebUI คืออะไร

บทความที่เกี่ยวข้อง

แนะนำจากเครือข่าย SiamCafe

บทความที่เกี่ยวข้อง