Ollama Container Orchestration
Ollama Local LLM Llama Mistral Gemma Docker Kubernetes Container GPU Scheduling Auto-scaling REST API Privacy Open Source NVIDIA
| Tool | ใช้เมื่อ | Scale | Complexity |
|---|---|---|---|
| Ollama (Bare) | Development, Testing | Single Machine | ต่ำ |
| Docker Compose | Small Team | Single Host Multi-container | ต่ำ |
| Kubernetes | Production | Multi-node Cluster | สูง |
| vLLM | High Throughput API | Enterprise | สูง |
Docker Deployment
# === Ollama Docker Setup ===
# Basic Docker Run
# docker run -d \
# --gpus all \
# -v ollama:/root/.ollama \
# -p 11434:11434 \
# --name ollama \
# ollama/ollama
# Pull and Run Model
# docker exec -it ollama ollama pull llama3:8b
# docker exec -it ollama ollama pull mistral:7b
# docker exec -it ollama ollama pull codellama:13b
# Docker Compose with Multiple Models
# version: '3.8'
# services:
# ollama:
# image: ollama/ollama:latest
# deploy:
# resources:
# reservations:
# devices:
# - driver: nvidia
# count: all
# capabilities: [gpu]
# volumes:
# - ollama_data:/root/.ollama
# ports:
# - "11434:11434"
# restart: always
# healthcheck:
# test: ["CMD", "curl", "-f", "http://localhost:11434/"]
# interval: 30s
# timeout: 10s
# retries: 3
#
# ollama-webui:
# image: ghcr.io/open-webui/open-webui:main
# ports:
# - "3000:8080"
# environment:
# - OLLAMA_BASE_URL=http://ollama:11434
# depends_on:
# - ollama
# volumes:
# - webui_data:/app/backend/data
#
# nginx:
# image: nginx:alpine
# ports:
# - "443:443"
# volumes:
# - ./nginx.conf:/etc/nginx/nginx.conf
# - ./certs:/etc/nginx/certs
# depends_on:
# - ollama
#
# volumes:
# ollama_data:
# webui_data:
# API Usage
# curl http://localhost:11434/api/generate -d '{
# "model": "llama3:8b",
# "prompt": "Explain Docker in 3 sentences",
# "stream": false
# }'
from dataclasses import dataclass
from typing import List
@dataclass
class OllamaModel:
name: str
size_gb: float
params: str
vram_gb: float
speed_tokens_s: float
use_case: str
models = [
OllamaModel("llama3:8b", 4.7, "8B", 6, 45, "General Chat, RAG"),
OllamaModel("mistral:7b", 4.1, "7B", 5, 50, "Fast General Purpose"),
OllamaModel("codellama:13b", 7.4, "13B", 10, 30, "Code Generation"),
OllamaModel("gemma2:9b", 5.4, "9B", 7, 40, "Google Model, Multilingual"),
OllamaModel("phi3:mini", 2.3, "3.8B", 3, 65, "Small Fast Efficient"),
OllamaModel("llama3:70b-q4", 40, "70B", 48, 10, "High Quality Large"),
]
print("=== Ollama Models ===")
for m in models:
print(f" [{m.name}] {m.params} | Size: {m.size_gb}GB | VRAM: {m.vram_gb}GB")
print(f" Speed: {m.speed_tokens_s} tok/s | Use: {m.use_case}")
Kubernetes Deployment
# === Kubernetes GPU Deployment ===
# Install NVIDIA GPU Operator
# helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
# helm install gpu-operator nvidia/gpu-operator \
# --namespace gpu-operator --create-namespace
# Ollama Deployment
# apiVersion: apps/v1
# kind: Deployment
# metadata:
# name: ollama
# namespace: llm
# spec:
# replicas: 2
# selector:
# matchLabels:
# app: ollama
# template:
# metadata:
# labels:
# app: ollama
# spec:
# containers:
# - name: ollama
# image: ollama/ollama:latest
# ports:
# - containerPort: 11434
# resources:
# requests:
# memory: "8Gi"
# cpu: "2"
# nvidia.com/gpu: "1"
# limits:
# memory: "16Gi"
# cpu: "4"
# nvidia.com/gpu: "1"
# volumeMounts:
# - name: ollama-data
# mountPath: /root/.ollama
# readinessProbe:
# httpGet:
# path: /
# port: 11434
# initialDelaySeconds: 30
# periodSeconds: 10
# livenessProbe:
# httpGet:
# path: /
# port: 11434
# initialDelaySeconds: 60
# periodSeconds: 30
# volumes:
# - name: ollama-data
# persistentVolumeClaim:
# claimName: ollama-pvc
#
# ---
# apiVersion: v1
# kind: Service
# metadata:
# name: ollama-svc
# namespace: llm
# spec:
# selector:
# app: ollama
# ports:
# - port: 11434
# targetPort: 11434
# type: ClusterIP
#
# ---
# apiVersion: autoscaling/v2
# kind: HorizontalPodAutoscaler
# metadata:
# name: ollama-hpa
# namespace: llm
# spec:
# scaleTargetRef:
# apiVersion: apps/v1
# kind: Deployment
# name: ollama
# minReplicas: 1
# maxReplicas: 4
# metrics:
# - type: Resource
# resource:
# name: cpu
# target:
# type: Utilization
# averageUtilization: 70
@dataclass
class K8sNode:
name: str
gpu: str
gpu_count: int
vram_gb: int
pods_running: int
gpu_utilization: float
nodes = [
K8sNode("gpu-node-01", "A100", 2, 80, 3, 72),
K8sNode("gpu-node-02", "A100", 2, 80, 2, 55),
K8sNode("gpu-node-03", "RTX 4090", 4, 24, 4, 85),
]
print("\n=== Kubernetes GPU Nodes ===")
for n in nodes:
print(f" [{n.name}] {n.gpu} x{n.gpu_count} ({n.vram_gb}GB each)")
print(f" Pods: {n.pods_running} | GPU Util: {n.gpu_utilization}%")
Monitoring และ Optimization
# === Monitoring & Performance ===
# Prometheus Metrics
# ollama_request_duration_seconds
# ollama_tokens_generated_total
# ollama_model_load_duration_seconds
# nvidia_gpu_utilization_percentage
# nvidia_gpu_memory_used_bytes
# Python Client
# import requests
#
# class OllamaClient:
# def __init__(self, base_url="http://localhost:11434"):
# self.base_url = base_url
#
# def generate(self, model, prompt, stream=False):
# response = requests.post(f"{self.base_url}/api/generate", json={
# "model": model, "prompt": prompt, "stream": stream,
# })
# return response.json()
#
# def chat(self, model, messages, stream=False):
# response = requests.post(f"{self.base_url}/api/chat", json={
# "model": model, "messages": messages, "stream": stream,
# })
# return response.json()
#
# def list_models(self):
# return requests.get(f"{self.base_url}/api/tags").json()
#
# def pull(self, model):
# return requests.post(f"{self.base_url}/api/pull",
# json={"name": model})
optimization = {
"Quantization": "ใช้ Q4_K_M ลด VRAM 50-60% คุณภาพลดเล็กน้อย",
"Context Length": "ลด Context Length ถ้าไม่ต้องการยาว ลด VRAM",
"Batch Size": "เพิ่ม Batch Size ถ้ามี VRAM เพียงพอ เพิ่ม Throughput",
"Model Caching": "Keep Model ใน Memory ไม่ต้อง Load ใหม่ทุกครั้ง",
"GPU Affinity": "Pin Pod กับ GPU เฉพาะ ลด Scheduling Overhead",
"Preload Models": "Preload Models ตอน Container Start",
}
print("Performance Optimization:")
for tip, desc in optimization.items():
print(f" [{tip}]: {desc}")
# Cost Comparison
costs = {
"Ollama Self-hosted (RTX 4090)": {"monthly": "$50 (electricity)", "per_1k": "$0.001"},
"Ollama Cloud GPU (A100)": {"monthly": "$800-1500", "per_1k": "$0.01"},
"OpenAI GPT-4o-mini": {"monthly": "Pay-per-use", "per_1k": "$0.15"},
"OpenAI GPT-4o": {"monthly": "Pay-per-use", "per_1k": "$2.50"},
"Anthropic Claude Sonnet": {"monthly": "Pay-per-use", "per_1k": "$3.00"},
}
print(f"\n\nCost Comparison:")
for provider, cost in costs.items():
print(f" [{provider}]")
print(f" Monthly: {cost['monthly']} | Per 1K tokens: {cost['per_1k']}")
เคล็ดลับ
- Quantize: ใช้ Q4 Quantization ลด VRAM คุณภาพยังดี
- Health Check: ตั้ง Health Check ทุก Container ตรวจ API Response
- GPU Operator: ใช้ NVIDIA GPU Operator บน K8s จัดการ GPU อัตโนมัติ
- Preload: Preload Model ที่ใช้บ่อยตอน Container Start
- Monitor: ดู GPU Utilization และ VRAM ตลอด ป้องกัน OOM
Ollama คืออะไร
รัน LLM Local Llama Mistral Gemma ติดตั้งง่าย REST API GPU NVIDIA Apple Silicon ฟรี Open Source Privacy
ทำไมต้อง Containerize Ollama
Deploy ง่าย Reproducible Scale Kubernetes Isolate GPU Resource Management Health Check Rolling Update CI/CD
GPU Scheduling บน Kubernetes ทำอย่างไร
NVIDIA GPU Operator Label Node Resource Requests Limits nvidia.com/gpu Scheduler จัดสรร Pod Time-slicing แชร์ GPU
Ollama กับ vLLM ต่างกันอย่างไร
Ollama ง่าย Dev Small-scale vLLM Production Performance Continuous Batching PagedAttention Throughput สูง Enterprise ตามขนาดงาน
สรุป
Ollama Local LLM Docker Kubernetes Container GPU Scheduling NVIDIA Auto-scaling HPA REST API Quantization Health Check Monitoring vLLM Cost Optimization Production Deployment
