SiamCafe · Blog
Ollama Local LLM Container Orchestration — รัน LLM บนเครื่องด้วย Docker Kubernetes
บทความ

Ollama Local LLM Container Orchestration — รัน LLM บนเครื่องด้วย Docker Kubernetes

เผยแพร่ 28 พฤษภาคม 2569

Ollama Container Orchestration

Ollama Local LLM Llama Mistral Gemma Docker Kubernetes Container GPU Scheduling Auto-scaling REST API Privacy Open Source NVIDIA

Toolใช้เมื่อScaleComplexity
Ollama (Bare)Development, TestingSingle Machineต่ำ
Docker ComposeSmall TeamSingle Host Multi-containerต่ำ
KubernetesProductionMulti-node Clusterสูง
vLLMHigh Throughput APIEnterpriseสูง

Docker Deployment

=== Ollama Docker Setup ===

Basic Docker Run

docker run -d \

--gpus all \

-v ollama:/root/.ollama \

-p 11434:11434 \

--name ollama \

ollama/ollama

Pull and Run Model

docker exec -it ollama ollama pull llama3:8b

docker exec -it ollama ollama pull mistral:7b

docker exec -it ollama ollama pull codellama:13b

Docker Compose with Multiple Models

version: '3.8'

services:

ollama:

image: ollama/ollama:latest

deploy:

resources:

reservations:

devices:

  • driver: nvidia

count: all

capabilities: [gpu]

volumes:

  • ollama_data:/root/.ollama

ports:

  • "11434:11434"

restart: always

healthcheck:

test: ["CMD", "curl", "-f", "http://localhost:11434/"]

interval: 30s

timeout: 10s

retries: 3

ollama-webui:

image: ghcr.io/open-webui/open-webui:main

ports:

  • "3000:8080"

environment:

  • OLLAMA_BASE_URL=http://ollama:11434

depends_on:

  • ollama

volumes:

  • webui_data:/app/backend/data

nginx:

image: nginx:alpine

ports:

  • "443:443"

volumes:

  • ./nginx.conf:/etc/nginx/nginx.conf
  • ./certs:/etc/nginx/certs

depends_on:

  • ollama

volumes:

ollama_data:

webui_data:

API Usage

curl http://localhost:11434/api/generate -d '{

"model": "llama3:8b",

"prompt": "Explain Docker in 3 sentences",

"stream": false

}'

from dataclasses import dataclass

from typing import List

@dataclass

class OllamaModel:

name: str

size_gb: float

params: str

vram_gb: float

speed_tokens_s: float

use_case: str

models = [

OllamaModel("llama3:8b", 4.7, "8B", 6, 45, "General Chat, RAG"),

OllamaModel("mistral:7b", 4.1, "7B", 5, 50, "Fast General Purpose"),

OllamaModel("codellama:13b", 7.4, "13B", 10, 30, "Code Generation"),

OllamaModel("gemma2:9b", 5.4, "9B", 7, 40, "Google Model, Multilingual"),

OllamaModel("phi3:mini", 2.3, "3.8B", 3, 65, "Small Fast Efficient"),

OllamaModel("llama3:70b-q4", 40, "70B", 48, 10, "High Quality Large"),

]

print("=== Ollama Models ===")

for m in models:

print(f" [{m.name}] {m.params} | Size: {m.size_gb}GB | VRAM: {m.vram_gb}GB")

print(f" Speed: {m.speed_tokens_s} tok/s | Use: {m.use_case}")

Kubernetes Deployment

=== Kubernetes GPU Deployment ===

Install NVIDIA GPU Operator

helm repo add nvidia https://helm.ngc.nvidia.com/nvidia

helm install gpu-operator nvidia/gpu-operator \

--namespace gpu-operator --create-namespace

Ollama Deployment

apiVersion: apps/v1

kind: Deployment

metadata:

name: ollama

namespace: llm

spec:

replicas: 2

selector:

matchLabels:

app: ollama

template:

metadata:

labels:

app: ollama

spec:

containers:

  • name: ollama

image: ollama/ollama:latest

ports:

  • containerPort: 11434

resources:

requests:

memory: "8Gi"

cpu: "2"

nvidia.com/gpu: "1"

limits:

memory: "16Gi"

cpu: "4"

nvidia.com/gpu: "1"

volumeMounts:

  • name: ollama-data

mountPath: /root/.ollama

readinessProbe:

httpGet:

path: /

port: 11434

initialDelaySeconds: 30

periodSeconds: 10

livenessProbe:

httpGet:

path: /

port: 11434

initialDelaySeconds: 60

periodSeconds: 30

volumes:

  • name: ollama-data

persistentVolumeClaim:

claimName: ollama-pvc

---

apiVersion: v1

kind: Service

metadata:

name: ollama-svc

namespace: llm

spec:

selector:

app: ollama

ports:

  • port: 11434

targetPort: 11434

type: ClusterIP

---

apiVersion: autoscaling/v2

kind: HorizontalPodAutoscaler

metadata:

name: ollama-hpa

namespace: llm

spec:

scaleTargetRef:

apiVersion: apps/v1

kind: Deployment

name: ollama

minReplicas: 1

maxReplicas: 4

metrics:

  • type: Resource

resource:

name: cpu

target:

type: Utilization

averageUtilization: 70

@dataclass

class K8sNode:

name: str

gpu: str

gpu_count: int

vram_gb: int

pods_running: int

gpu_utilization: float

nodes = [

K8sNode("gpu-node-01", "A100", 2, 80, 3, 72),

K8sNode("gpu-node-02", "A100", 2, 80, 2, 55),

K8sNode("gpu-node-03", "RTX 4090", 4, 24, 4, 85),

]

print("\n=== Kubernetes GPU Nodes ===")

for n in nodes:

print(f" [{n.name}] {n.gpu} x{n.gpu_count} ({n.vram_gb}GB each)")

print(f" Pods: {n.pods_running} | GPU Util: {n.gpu_utilization}%")

Monitoring และ Optimization

=== Monitoring & Performance ===

Prometheus Metrics

ollama_request_duration_seconds

ollama_tokens_generated_total

ollama_model_load_duration_seconds

nvidia_gpu_utilization_percentage

nvidia_gpu_memory_used_bytes

Python Client

import requests

class OllamaClient:

def __init__(self, base_url="http://localhost:11434"):

self.base_url = base_url

def generate(self, model, prompt, stream=False):

response = requests.post(f"{self.base_url}/api/generate", json={

"model": model, "prompt": prompt, "stream": stream,

})

return response.json()

def chat(self, model, messages, stream=False):

response = requests.post(f"{self.base_url}/api/chat", json={

"model": model, "messages": messages, "stream": stream,

})

return response.json()

def list_models(self):

return requests.get(f"{self.base_url}/api/tags").json()

def pull(self, model):

return requests.post(f"{self.base_url}/api/pull",

json={"name": model})

optimization = {

"Quantization": "ใช้ Q4_K_M ลด VRAM 50-60% คุณภาพลดเล็กน้อย",

"Context Length": "ลด Context Length ถ้าไม่ต้องการยาว ลด VRAM",

"Batch Size": "เพิ่ม Batch Size ถ้ามี VRAM เพียงพอ เพิ่ม Throughput",

"Model Caching": "Keep Model ใน Memory ไม่ต้อง Load ใหม่ทุกครั้ง",

"GPU Affinity": "Pin Pod กับ GPU เฉพาะ ลด Scheduling Overhead",

"Preload Models": "Preload Models ตอน Container Start",

}

print("Performance Optimization:")

for tip, desc in optimization.items():

print(f" [{tip}]: {desc}")

Cost Comparison

costs = {

"Ollama Self-hosted (RTX 4090)": {"monthly": "$50 (electricity)", "per_1k": "$0.001"},

"Ollama Cloud GPU (A100)": {"monthly": "$800-1500", "per_1k": "$0.01"},

"OpenAI GPT-4o-mini": {"monthly": "Pay-per-use", "per_1k": "$0.15"},

"OpenAI GPT-4o": {"monthly": "Pay-per-use", "per_1k": "$2.50"},

"Anthropic Claude Sonnet": {"monthly": "Pay-per-use", "per_1k": "$3.00"},

}

print(f"\n\nCost Comparison:")

for provider, cost in costs.items():

print(f" [{provider}]")

print(f" Monthly: {cost['monthly']} | Per 1K tokens: {cost['per_1k']}")

เคล็ดลับ

  • Quantize: ใช้ Q4 Quantization ลด VRAM คุณภาพยังดี
  • Health Check: ตั้ง Health Check ทุก Container ตรวจ API Response
  • GPU Operator: ใช้ NVIDIA GPU Operator บน K8s จัดการ GPU อัตโนมัติ
  • Preload: Preload Model ที่ใช้บ่อยตอน Container Start
  • Monitor: ดู GPU Utilization และ VRAM ตลอด ป้องกัน OOM

Ollama คืออะไร

รัน LLM Local Llama Mistral Gemma ติดตั้งง่าย REST API GPU NVIDIA Apple Silicon ฟรี Open Source Privacy

ทำไมต้อง Containerize Ollama

Deploy ง่าย Reproducible Scale Kubernetes Isolate GPU Resource Management Health Check Rolling Update CI/CD

GPU Scheduling บน Kubernetes ทำอย่างไร

NVIDIA GPU Operator Label Node Resource Requests Limits nvidia.com/gpu Scheduler จัดสรร Pod Time-slicing แชร์ GPU

Ollama กับ vLLM ต่างกันอย่างไร

Ollama ง่าย Dev Small-scale vLLM Production Performance Continuous Batching PagedAttention Throughput สูง Enterprise ตามขนาดงาน

สรุป

Ollama Local LLM Docker Kubernetes Container GPU Scheduling NVIDIA Auto-scaling HPA REST API Quantization Health Check Monitoring vLLM Cost Optimization Production Deployment