LocalAI Self-hosted Production Setup Guide — รัน

LocalAI Production Setup

LocalAI Self-hosted AI Server OpenAI API Compatible LLM STT TTS Embeddings GPU NVIDIA Docker Kubernetes Production Privacy

เนื้อหาเกี่ยวข้อง — อ่านต่อ: วิธีสร้างเว็บไซต์ canva

Feature	LocalAI	OpenAI API	Ollama
Cost	ฟรี (Hardware เอง)	Pay per Token	ฟรี (Hardware เอง)
Privacy	100% On-premise	ข้อมูลส่ง Cloud	100% On-premise
API Compatible	OpenAI Format	OpenAI Format	Custom + OpenAI
Models	GGUF ทุกตัว + SD + Whisper	GPT-4 DALL-E Whisper	GGUF ทุกตัว
GPU Support	NVIDIA CUDA + AMD ROCm	Cloud GPU	NVIDIA CUDA + Metal
Production Ready	Docker K8s Metrics	SaaS Ready	Docker Basic

Installation & Configuration

# === LocalAI Docker Production Setup ===

# Docker Compose (docker-compose.yml)
# version: '3.8'
# services:
#   localai:
#     image: localai/localai:latest-gpu-nvidia-cuda-12
#     ports:
#       - "8080:8080"
#     volumes:
#       - ./models:/build/models
#       - ./config:/build/config
#     environment:
#       - THREADS=8
#       - CONTEXT_SIZE=4096
#       - GALLERIES=[{"name":"model-gallery","url":"github:go-skynet/model-gallery/index.yaml"}]
#       - API_KEY=your-secret-api-key
#     deploy:
#       resources:
#         reservations:
#           devices:
#             - driver: nvidia
#               count: all
#               capabilities: [gpu]
#     restart: always
#     healthcheck:
#       test: ["CMD", "curl", "-f", "http://localhost:8080/readyz"]
#       interval: 30s
#       timeout: 10s
#       retries: 3

# Download Model
# curl -L "https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF/resolve/main/llama-2-7b-chat.Q4_K_M.gguf" \
#   -o models/llama2-7b-chat.gguf

# Model Config (config/llama2.yaml)
# name: llama2-chat
# backend: llama-cpp
# parameters:
#   model: llama2-7b-chat.gguf
#   temperature: 0.7
#   top_p: 0.9
#   gpu_layers: 99
#   context_size: 4096
#   flash_attention: true

from dataclasses import dataclass

@dataclass
class ModelConfig:
    model: str
    size: str
    vram: str
    speed: str
    quality: str
    use_case: str

models = [
    ModelConfig("Llama 3.1 8B Q4_K_M",
        "4.9GB",
        "5-6GB VRAM",
        "~30 tok/s (RTX 3090)",
        "ดีมาก (Best Open Source 8B)",
        "Chat General Purpose Thai+English"),
    ModelConfig("Mistral 7B Q4_K_M",
        "4.4GB",
        "5GB VRAM",
        "~35 tok/s (RTX 3090)",
        "ดี (Fast + Good Quality)",
        "Chat Code Analysis"),
    ModelConfig("Phi-3 Mini 3.8B Q4",
        "2.2GB",
        "3GB VRAM",
        "~50 tok/s (RTX 3090)",
        "ดี (สำหรับขนาด)",
        "Edge Device Low VRAM Quick Response"),
    ModelConfig("Whisper Large V3",
        "3.1GB",
        "4GB VRAM",
        "~5x Realtime",
        "แม่นมาก (Best STT)",
        "Speech to Text Transcription"),
    ModelConfig("nomic-embed-text",
        "274MB",
        "1GB VRAM",
        "~1000 docs/s",
        "ดีมาก (Top Embedding)",
        "RAG Vector Search Semantic Search"),
]

print("=== Recommended Models ===")
for m in models:
    print(f"  [{m.model}] Size: {m.size} | VRAM: {m.vram}")
    print(f"    Speed: {m.speed}")
    print(f"    Quality: {m.quality}")
    print(f"    Use: {m.use_case}")

API Usage

# === OpenAI-compatible API Usage ===

# curl http://localhost:8080/v1/chat/completions \
#   -H "Content-Type: application/json" \
#   -H "Authorization: Bearer your-secret-api-key" \
#   -d '{
#     "model": "llama2-chat",
#     "messages": [{"role": "user", "content": "Hello"}],
#     "temperature": 0.7,
#     "max_tokens": 500
#   }'

# Python (OpenAI SDK)
# from openai import OpenAI
# client = OpenAI(base_url="http://localhost:8080/v1", api_key="your-key")
# response = client.chat.completions.create(
#     model="llama2-chat",
#     messages=[{"role": "user", "content": "Hello"}],
#     temperature=0.7,
#     max_tokens=500,
# )
# print(response.choices[0].message.content)

# Embeddings
# response = client.embeddings.create(
#     model="nomic-embed-text",
#     input=["Hello world", "How are you"]
# )

# STT (Whisper)
# audio = open("speech.wav", "rb")
# transcript = client.audio.transcriptions.create(
#     model="whisper-large-v3", file=audio
# )

@dataclass
class APIEndpoint:
    endpoint: str
    method: str
    model_type: str
    example: str

endpoints = [
    APIEndpoint("/v1/chat/completions",
        "POST",
        "LLM (Llama Mistral Phi)",
        "Chat Conversation Q&A Summarization"),
    APIEndpoint("/v1/completions",
        "POST",
        "LLM (Text Completion)",
        "Text Generation Code Completion"),
    APIEndpoint("/v1/embeddings",
        "POST",
        "Embedding (nomic all-MiniLM)",
        "RAG Vector Search Semantic Similarity"),
    APIEndpoint("/v1/audio/transcriptions",
        "POST",
        "Whisper (STT)",
        "Speech to Text Meeting Notes"),
    APIEndpoint("/v1/images/generations",
        "POST",
        "Stable Diffusion",
        "Image Generation from Text Prompt"),
    APIEndpoint("/readyz",
        "GET",
        "Health Check",
        "Load Balancer Health Probe"),
]

print("=== API Endpoints ===")
for e in endpoints:
    print(f"  [{e.method} {e.endpoint}] Model: {e.model_type}")
    print(f"    Use: {e.example}")

Production Monitoring

# === Production Monitoring ===

# Nginx Load Balancer
# upstream localai {
#     server localai-1:8080;
#     server localai-2:8080;
#     server localai-3:8080;
# }
# server {
#     listen 443 ssl;
#     location / {
#         proxy_pass http://localai;
#         proxy_read_timeout 300s;
#     }
# }

# Prometheus scrape
# scrape_configs:
#   - job_name: 'localai'
#     static_configs:
#       - targets: ['localai-1:8080', 'localai-2:8080']
#     metrics_path: /metrics

@dataclass
class ProdMetric:
    metric: str
    source: str
    alert: str
    action: str

metrics = [
    ProdMetric("Request Latency P99",
        "/metrics (localai_request_duration)",
        "> 10 seconds",
        "Scale GPU instances ลด Context Size"),
    ProdMetric("GPU Memory Usage",
        "nvidia-smi / DCGM Exporter",
        "> 90% VRAM",
        "ใช้ Model เล็กกว่า หรือ Quantize มากขึ้น"),
    ProdMetric("Token Generation Speed",
        "/metrics (tokens_per_second)",
        "< 10 tok/s",
        "ตรวจ GPU Load เพิ่ม GPU Layers"),
    ProdMetric("Error Rate",
        "/metrics (localai_request_errors)",
        "> 1%",
        "ตรวจ Model Config Memory OOM"),
    ProdMetric("Queue Length",
        "/metrics (localai_request_queue)",
        "> 10 pending",
        "Scale instances เพิ่ม parallel-requests"),
    ProdMetric("Health Check",
        "/readyz endpoint",
        "Not 200 for 30s",
        "Auto-restart Container Alert Team"),
]

print("=== Production Metrics ===")
for m in metrics:
    print(f"  [{m.metric}] Source: {m.source}")
    print(f"    Alert: {m.alert}")
    print(f"    Action: {m.action}")

เคล็ดลับ

Q4_K_M: ใช้ Quantization Q4_K_M Balance ระหว่าง Speed กับ Quality
Flash Attention: เปิด Flash Attention ลด VRAM เพิ่ม Speed
GPU Layers: ใส่ gpu_layers: 99 ให้ทุก Layer อยู่บน GPU
API Key: ตั้ง API Key เสมอ ป้องกัน Unauthorized Access
Health Check: ใช้ /readyz สำหรับ Load Balancer Health Probe

LocalAI คืออะไร

Open Source AI Server OpenAI Compatible LLM STT TTS Embeddings Image Privacy On-premise GGUF Docker Kubernetes GPU NVIDIA Free

แนะนำเพิ่มเติม — หนังสือเทรดที่ SiamCafeBook

เนื้อหาเกี่ยวข้อง — ทำความเข้าใจ Fail2ban Advanced Micro-segmentation

เนื้อหาเกี่ยวข้อง — Prometheus PromQL CI CD Automation Pipeline — คู่มือฉบับสมบูรณ์ 2026