LocalAI Production Setup
LocalAI Self-hosted AI Server OpenAI API Compatible LLM STT TTS Embeddings GPU NVIDIA Docker Kubernetes Production Privacy
| Feature | LocalAI | OpenAI API | Ollama |
|---|---|---|---|
| Cost | ฟรี (Hardware เอง) | Pay per Token | ฟรี (Hardware เอง) |
| Privacy | 100% On-premise | ข้อมูลส่ง Cloud | 100% On-premise |
| API Compatible | OpenAI Format | OpenAI Format | Custom + OpenAI |
| Models | GGUF ทุกตัว + SD + Whisper | GPT-4 DALL-E Whisper | GGUF ทุกตัว |
| GPU Support | NVIDIA CUDA + AMD ROCm | Cloud GPU | NVIDIA CUDA + Metal |
| Production Ready | Docker K8s Metrics | SaaS Ready | Docker Basic |
Installation & Configuration
# === LocalAI Docker Production Setup ===
# Docker Compose (docker-compose.yml)
# version: '3.8'
# services:
# localai:
# image: localai/localai:latest-gpu-nvidia-cuda-12
# ports:
# - "8080:8080"
# volumes:
# - ./models:/build/models
# - ./config:/build/config
# environment:
# - THREADS=8
# - CONTEXT_SIZE=4096
# - GALLERIES=[{"name":"model-gallery","url":"github:go-skynet/model-gallery/index.yaml"}]
# - API_KEY=your-secret-api-key
# deploy:
# resources:
# reservations:
# devices:
# - driver: nvidia
# count: all
# capabilities: [gpu]
# restart: always
# healthcheck:
# test: ["CMD", "curl", "-f", "http://localhost:8080/readyz"]
# interval: 30s
# timeout: 10s
# retries: 3
# Download Model
# curl -L "https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF/resolve/main/llama-2-7b-chat.Q4_K_M.gguf" \
# -o models/llama2-7b-chat.gguf
# Model Config (config/llama2.yaml)
# name: llama2-chat
# backend: llama-cpp
# parameters:
# model: llama2-7b-chat.gguf
# temperature: 0.7
# top_p: 0.9
# gpu_layers: 99
# context_size: 4096
# flash_attention: true
from dataclasses import dataclass
@dataclass
class ModelConfig:
model: str
size: str
vram: str
speed: str
quality: str
use_case: str
models = [
ModelConfig("Llama 3.1 8B Q4_K_M",
"4.9GB",
"5-6GB VRAM",
"~30 tok/s (RTX 3090)",
"ดีมาก (Best Open Source 8B)",
"Chat General Purpose Thai+English"),
ModelConfig("Mistral 7B Q4_K_M",
"4.4GB",
"5GB VRAM",
"~35 tok/s (RTX 3090)",
"ดี (Fast + Good Quality)",
"Chat Code Analysis"),
ModelConfig("Phi-3 Mini 3.8B Q4",
"2.2GB",
"3GB VRAM",
"~50 tok/s (RTX 3090)",
"ดี (สำหรับขนาด)",
"Edge Device Low VRAM Quick Response"),
ModelConfig("Whisper Large V3",
"3.1GB",
"4GB VRAM",
"~5x Realtime",
"แม่นมาก (Best STT)",
"Speech to Text Transcription"),
ModelConfig("nomic-embed-text",
"274MB",
"1GB VRAM",
"~1000 docs/s",
"ดีมาก (Top Embedding)",
"RAG Vector Search Semantic Search"),
]
print("=== Recommended Models ===")
for m in models:
print(f" [{m.model}] Size: {m.size} | VRAM: {m.vram}")
print(f" Speed: {m.speed}")
print(f" Quality: {m.quality}")
print(f" Use: {m.use_case}")
API Usage
# === OpenAI-compatible API Usage ===
# curl http://localhost:8080/v1/chat/completions \
# -H "Content-Type: application/json" \
# -H "Authorization: Bearer your-secret-api-key" \
# -d '{
# "model": "llama2-chat",
# "messages": [{"role": "user", "content": "Hello"}],
# "temperature": 0.7,
# "max_tokens": 500
# }'
# Python (OpenAI SDK)
# from openai import OpenAI
# client = OpenAI(base_url="http://localhost:8080/v1", api_key="your-key")
# response = client.chat.completions.create(
# model="llama2-chat",
# messages=[{"role": "user", "content": "Hello"}],
# temperature=0.7,
# max_tokens=500,
# )
# print(response.choices[0].message.content)
# Embeddings
# response = client.embeddings.create(
# model="nomic-embed-text",
# input=["Hello world", "How are you"]
# )
# STT (Whisper)
# audio = open("speech.wav", "rb")
# transcript = client.audio.transcriptions.create(
# model="whisper-large-v3", file=audio
# )
@dataclass
class APIEndpoint:
endpoint: str
method: str
model_type: str
example: str
endpoints = [
APIEndpoint("/v1/chat/completions",
"POST",
"LLM (Llama Mistral Phi)",
"Chat Conversation Q&A Summarization"),
APIEndpoint("/v1/completions",
"POST",
"LLM (Text Completion)",
"Text Generation Code Completion"),
APIEndpoint("/v1/embeddings",
"POST",
"Embedding (nomic all-MiniLM)",
"RAG Vector Search Semantic Similarity"),
APIEndpoint("/v1/audio/transcriptions",
"POST",
"Whisper (STT)",
"Speech to Text Meeting Notes"),
APIEndpoint("/v1/images/generations",
"POST",
"Stable Diffusion",
"Image Generation from Text Prompt"),
APIEndpoint("/readyz",
"GET",
"Health Check",
"Load Balancer Health Probe"),
]
print("=== API Endpoints ===")
for e in endpoints:
print(f" [{e.method} {e.endpoint}] Model: {e.model_type}")
print(f" Use: {e.example}")
Production Monitoring
# === Production Monitoring ===
# Nginx Load Balancer
# upstream localai {
# server localai-1:8080;
# server localai-2:8080;
# server localai-3:8080;
# }
# server {
# listen 443 ssl;
# location / {
# proxy_pass http://localai;
# proxy_read_timeout 300s;
# }
# }
# Prometheus scrape
# scrape_configs:
# - job_name: 'localai'
# static_configs:
# - targets: ['localai-1:8080', 'localai-2:8080']
# metrics_path: /metrics
@dataclass
class ProdMetric:
metric: str
source: str
alert: str
action: str
metrics = [
ProdMetric("Request Latency P99",
"/metrics (localai_request_duration)",
"> 10 seconds",
"Scale GPU instances ลด Context Size"),
ProdMetric("GPU Memory Usage",
"nvidia-smi / DCGM Exporter",
"> 90% VRAM",
"ใช้ Model เล็กกว่า หรือ Quantize มากขึ้น"),
ProdMetric("Token Generation Speed",
"/metrics (tokens_per_second)",
"< 10 tok/s",
"ตรวจ GPU Load เพิ่ม GPU Layers"),
ProdMetric("Error Rate",
"/metrics (localai_request_errors)",
"> 1%",
"ตรวจ Model Config Memory OOM"),
ProdMetric("Queue Length",
"/metrics (localai_request_queue)",
"> 10 pending",
"Scale instances เพิ่ม parallel-requests"),
ProdMetric("Health Check",
"/readyz endpoint",
"Not 200 for 30s",
"Auto-restart Container Alert Team"),
]
print("=== Production Metrics ===")
for m in metrics:
print(f" [{m.metric}] Source: {m.source}")
print(f" Alert: {m.alert}")
print(f" Action: {m.action}")
เคล็ดลับ
- Q4_K_M: ใช้ Quantization Q4_K_M Balance ระหว่าง Speed กับ Quality
- Flash Attention: เปิด Flash Attention ลด VRAM เพิ่ม Speed
- GPU Layers: ใส่ gpu_layers: 99 ให้ทุก Layer อยู่บน GPU
- API Key: ตั้ง API Key เสมอ ป้องกัน Unauthorized Access
- Health Check: ใช้ /readyz สำหรับ Load Balancer Health Probe
LocalAI คืออะไร
Open Source AI Server OpenAI Compatible LLM STT TTS Embeddings Image Privacy On-premise GGUF Docker Kubernetes GPU NVIDIA Free
ติดตั้งอย่างไร
Docker Compose GPU NVIDIA CUDA nvidia-container-toolkit Binary Kubernetes Helm Model GGUF Hugging Face Q4_K_M Config YAML
Model แนะนำมีอะไร
Llama 3.1 8B Mistral 7B Phi-3 Mini Whisper Large nomic-embed Q4_K_M GGUF 5GB VRAM 30 tok/s Chat Code Embedding STT
Production ตั้งอย่างไร
RTX 3090 A100 GPU Nginx LB Prometheus Grafana API Key TLS Flash Attention Health Check Auto-restart Queue Scaling Monitoring
สรุป
LocalAI Self-hosted AI OpenAI API LLM STT TTS Embeddings GGUF GPU Docker Kubernetes Prometheus Production Privacy PDPA GDPR
