ในปี 2026 การใช้ AI ไม่จำเป็นต้องส่งข้อมูลไปที่ OpenAI, Google หรือ Anthropic อีกต่อไป Local LLM (Large Language Model) ที่รันที่เครื่องของเราเอง ทำให้ใช้ AI ได้ ฟรี, Privacy 100%, และ Offline ได้ Ollama เป็นเครื่องมือที่ช่วยให้การติดตั้งและใช้งาน LLM ง่ายที่สุด รองรับ Llama 3.1, Gemma 2, Mistral, Phi-3 และอีกมากมาย ด้วย GPU ราคา 10,000-30,000 บาทก็รัน LLM 70B parameters ได้ที่บ้าน
Ollama คืออะไร? ทำไมต้องใช้?
# =============================================
# Ollama:
# =============================================
# → เครื่องมือรัน LLM บน Local
# → รองรับ Windows, macOS, Linux
# → GPU acceleration (NVIDIA, AMD, Apple Silicon)
# → API เข้ากันได้กับ OpenAI
# → รองรับ 100+ Models
# → ฟรี Open Source
#
# =============================================
# ข้อดีของ Local LLM:
# =============================================
# ✓ Privacy 100% (ข้อมูลไม่ออกจากเครื่อง)
# ✓ ไม่มีค่าใช้จ่าย API
# ✓ ไม่จำกัดจำนวน Requests
# ✓ Offline ใช้ได้
# ✓ Custom Fine-tuning ได้
# ✓ ไม่ต้องกลัว OpenAI Outage
# ✓ ไม่มี Rate Limit
#
# =============================================
# ข้อเสีย:
# =============================================
# ✗ ต้องมี GPU (RTX 3060 ขึ้นไป)
# ✗ ใช้ไฟฟ้า (200-400W)
# ✗ Model ไม่เก่งเท่า GPT-5 / Claude Opus 4.7
# ✗ Setup ต้องรู้เทคนิค
# ✗ Model ใหญ่ = SSD 50GB+
Hardware ที่ต้องใช้
| Model Size | Parameters | VRAM ต้องการ | GPU แนะนำ |
|---|---|---|---|
| Tiny | 1-3B | 2-4GB | Any modern GPU |
| Small | 7-8B | 6-8GB | RTX 3060, RTX 4060 |
| Medium | 13-14B | 10-16GB | RTX 3080, RTX 4070 |
| Large | 30-34B | 20-24GB | RTX 3090, RTX 4080 |
| Huge | 70B | 40-48GB | RTX 4090 (x2) / A100 |
| Giant | 180B+ | 96GB+ | Multi-GPU Server |
Install Ollama
# =============================================
# Linux (Ubuntu 24.04 LTS):
# =============================================
# curl -fsSL https://ollama.com/install.sh | sh
#
# # Start service
# systemctl enable ollama
# systemctl start ollama
#
# # Verify
# ollama --version
#
# =============================================
# macOS:
# =============================================
# Download: https://ollama.com/download
# หรือ: brew install ollama
#
# =============================================
# Windows:
# =============================================
# Download: https://ollama.com/download/windows
# Install (.exe)
#
# =============================================
# Docker (Cross-platform):
# =============================================
# docker pull ollama/ollama
# docker run -d \
# --gpus all \
# -v ollama:/root/.ollama \
# -p 11434:11434 \
# --name ollama \
# ollama/ollama
#
# =============================================
# Verify GPU:
# =============================================
# ollama ps # Check GPU usage
# nvidia-smi # NVIDIA GPU stats
Top Models 2026
# =============================================
# Llama 3.1 (Meta):
# =============================================
# → 8B, 70B, 405B parameters
# → ภาษาหลายๆ รวมไทย
# → Fine-tune ง่าย
# → เร็วที่สุดในกลุ่ม
#
# ollama run llama3.1:8b # 4.7GB
# ollama run llama3.1:70b # 40GB (ต้อง 48GB VRAM)
# ollama run llama3.1:70b-q4 # 40GB (Quantized)
#
# =============================================
# Gemma 2 (Google):
# =============================================
# → 2B, 9B, 27B parameters
# → Google's Open Source
# → ดีใน Code และ Reasoning
# → เร็ว + เบา
#
# ollama run gemma2:2b # 1.6GB
# ollama run gemma2:9b # 5.4GB
# ollama run gemma2:27b # 16GB
#
# =============================================
# Mistral (Mistral AI):
# =============================================
# → 7B, 8x7B (MoE)
# → เฉลียวฉลาดแม้ Size เล็ก
# → ภาษายุโรปเก่ง
#
# ollama run mistral:7b # 4.1GB
# ollama run mixtral:8x7b # 26GB
#
# =============================================
# Phi-3 (Microsoft):
# =============================================
# → 3.8B, 14B parameters
# → เล็กแต่ฉลาด
# → ใช้บน edge device ได้
#
# ollama run phi3:mini # 2.3GB
# ollama run phi3:medium # 7.9GB
#
# =============================================
# DeepSeek Coder:
# =============================================
# → เขียน Code เก่ง
# → 6.7B, 33B
# → รองรับ 100+ ภาษา
#
# ollama run deepseek-coder:6.7b
# ollama run deepseek-coder:33b
#
# =============================================
# Thai-Specific Models:
# =============================================
# → SeaLion (Singapore, ไทยดี)
# → Typhoon (Thai NLP)
# → Pantip-LLM
#
# ollama run typhoon:7b
Basic Usage
# =============================================
# Run Model (Interactive):
# =============================================
# ollama run llama3.1
# >>> Hello! How are you?
# >>> /bye (exit)
#
# =============================================
# Pull Model (Download):
# =============================================
# ollama pull llama3.1:8b
# ollama pull gemma2:9b
#
# =============================================
# List Models:
# =============================================
# ollama list
#
# NAME ID SIZE MODIFIED
# llama3.1:8b 4f822... 4.7GB 2 hours ago
# gemma2:9b a4c3b... 5.4GB 5 minutes ago
#
# =============================================
# Delete Model:
# =============================================
# ollama rm gemma2:9b
#
# =============================================
# Model Info:
# =============================================
# ollama show llama3.1:8b
#
# =============================================
# Running Models (Status):
# =============================================
# ollama ps
#
# NAME SIZE PROCESSOR UNTIL
# llama3.1:8b 4.7GB 100% GPU 4 min from now
REST API
# =============================================
# Ollama API (OpenAI Compatible):
# =============================================
#
# Endpoint: http://localhost:11434/api/generate
#
# =============================================
# Basic Request (curl):
# =============================================
# curl http://localhost:11434/api/generate -d '{
# "model": "llama3.1:8b",
# "prompt": "Why is the sky blue?",
# "stream": false
# }'
#
# =============================================
# Python Example:
# =============================================
import requests
response = requests.post(
"http://localhost:11434/api/generate",
json={
"model": "llama3.1:8b",
"prompt": "Explain quantum computing",
"stream": False
}
)
print(response.json()["response"])
# =============================================
# OpenAI SDK Compatible:
# =============================================
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama" # ไม่ใช้จริง
)
completion = client.chat.completions.create(
model="llama3.1:8b",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Hello!"}
]
)
print(completion.choices[0].message.content)
Quantization: ใช้ Model ใหญ่บน GPU เล็ก
# =============================================
# Quantization คืออะไร?
# =============================================
# → ลดความละเอียดของ Model
# → จาก FP16 → INT4 (Q4)
# → ใช้ RAM/VRAM น้อยลง 75%
# → คุณภาพลดลง 5-10%
#
# =============================================
# Quantization Levels:
# =============================================
# FP16 (full): 100% quality, 100% size
# Q8_0: 99% quality, 50% size
# Q6_K: 98% quality, 37% size
# Q5_K_M: 96% quality, 32% size
# Q4_K_M: 92% quality, 28% size (แนะนำ)
# Q4_0: 90% quality, 25% size
# Q3_K_M: 85% quality, 22% size
# Q2_K: 75% quality, 19% size (เสี่ยง)
#
# =============================================
# ตัวอย่าง: Llama 3.1 70B
# =============================================
# Original FP16: 140GB (VRAM มากๆ)
# Q8_0: 70GB
# Q5_K_M: 50GB
# Q4_K_M: 42GB (RTX 4090 คู่)
# Q3_K_M: 32GB (RTX 4090 เดี่ยว!)
# Q2_K: 26GB (RTX 3090)
#
# =============================================
# ใช้งาน:
# =============================================
# ollama pull llama3.1:70b-q4_K_M
# ollama run llama3.1:70b-q4_K_M
#
# → รัน Model 70B บน RTX 4090 ได้!
GPU Optimization
# =============================================
# NVIDIA GPU Settings:
# =============================================
#
# Set environment variables:
# export CUDA_VISIBLE_DEVICES=0,1
# export OLLAMA_NUM_GPU=99 # Use all GPUs
# export OLLAMA_GPU_LAYERS=35 # Offload to GPU
#
# =============================================
# Benchmark:
# =============================================
# ollama run llama3.1:8b --verbose
#
# > Response:
# > eval count: 100 tokens
# > eval duration: 2.5s
# > eval rate: 40 tokens/second
#
# =============================================
# Performance by GPU:
# =============================================
# RTX 3060 (12GB): Llama 8B @ 25 t/s
# RTX 3090 (24GB): Llama 8B @ 60 t/s
# RTX 4070 (12GB): Llama 8B @ 35 t/s
# RTX 4080 (16GB): Llama 8B @ 55 t/s
# RTX 4090 (24GB): Llama 8B @ 85 t/s
# RTX 4090 (24GB): Llama 70B Q4 @ 15 t/s
# RTX 4090 x2: Llama 70B Q4 @ 30 t/s
#
# =============================================
# AMD GPU (ROCm):
# =============================================
# # Install ROCm
# apt install rocm-dkms
#
# # Ollama with AMD GPU
# HSA_OVERRIDE_GFX_VERSION=10.3.0 ollama serve
#
# =============================================
# Apple Silicon (M1/M2/M3/M4):
# =============================================
# → Unified Memory ใช้เป็น VRAM
# → M2 Max 96GB = รัน 70B ได้!
# → Performance ดีกว่าที่คิด
# → เหมาะกับ Ollama มาก
Custom Model & Fine-tuning
# =============================================
# Modelfile (Custom Model):
# =============================================
#
# # ไฟล์: Modelfile
# FROM llama3.1:8b
#
# # System Prompt
# SYSTEM """You are a Thai financial advisor.
# Answer in Thai language.
# Be concise and professional."""
#
# # Parameters
# PARAMETER temperature 0.7
# PARAMETER top_p 0.9
# PARAMETER top_k 40
# PARAMETER num_ctx 8192
# PARAMETER repeat_penalty 1.1
#
# # Template
# TEMPLATE """<|start_header_id|>user<|end_header_id|>
# {{ .Prompt }}<|eot_id|>
# <|start_header_id|>assistant<|end_header_id|>
# """
#
# =============================================
# Build Custom Model:
# =============================================
# ollama create thai-finance-advisor -f Modelfile
# ollama run thai-finance-advisor
#
# =============================================
# Fine-tune with LoRA:
# =============================================
# # Use Axolotl or Unsloth
# pip install unsloth
#
# from unsloth import FastLanguageModel
# model, tokenizer = FastLanguageModel.from_pretrained(
# "unsloth/llama-3-8b",
# max_seq_length=2048,
# load_in_4bit=True,
# )
# # Train on custom data
# # Export as GGUF for Ollama
Web UI: Open WebUI
# =============================================
# Open WebUI (ChatGPT-like UI):
# =============================================
# → Docker-based
# → Support Ollama natively
# → Multi-user, Authentication
# → Plugins, RAG, Tools
#
# =============================================
# Install (Docker):
# =============================================
# docker run -d \
# --network=host \
# -v open-webui:/app/backend/data \
# -e OLLAMA_BASE_URL=http://127.0.0.1:11434 \
# --name open-webui \
# --restart always \
# ghcr.io/open-webui/open-webui:main
#
# เข้า: http://localhost:8080
#
# =============================================
# Features:
# =============================================
# ✓ Chat Interface
# ✓ Multi-conversation
# ✓ Image Upload (with Vision Models)
# ✓ Document RAG (PDF, TXT)
# ✓ Web Search Integration
# ✓ Code Execution
# ✓ User Management
# ✓ API Access
#
# =============================================
# Alternative Web UIs:
# =============================================
# → LM Studio (Desktop app)
# → Jan.ai (Open Source)
# → Continue.dev (VSCode extension)
# → Ollama Web UI
# → Chatbox
Use Cases: Local LLM ทำอะไรได้
# =============================================
# 1. Code Assistant:
# =============================================
# → Copilot-like (offline, free)
# → DeepSeek Coder 33B
# → Continue.dev + Ollama
# → ใช้ใน VSCode
#
# =============================================
# 2. Chat / Assistant:
# =============================================
# → ChatGPT replacement
# → Llama 3.1 70B
# → Gemma 2 27B
# → Private conversations
#
# =============================================
# 3. Document Analysis:
# =============================================
# → RAG (Retrieval Augmented Generation)
# → LangChain + Ollama
# → Upload PDF, ask questions
# → สำหรับงานบริษัทที่ Privacy สำคัญ
#
# =============================================
# 4. Translation:
# =============================================
# → ไทย <-> อังกฤษ
# → Llama 3.1 8B
# → แม่นยำ 85-90%
#
# =============================================
# 5. Content Generation:
# =============================================
# → Blog post, Article
# → Social Media content
# → Product descriptions
# → ไม่ต้องพึ่ง OpenAI
#
# =============================================
# 6. Data Extraction:
# =============================================
# → Extract JSON from text
# → Parse resumes
# → Classify emails
#
# =============================================
# 7. Image Description (Vision Models):
# =============================================
# → LLaVA, Bakllava
# → อธิบายภาพ
# → OCR + Understanding
#
# ollama pull llava
# ollama run llava "Describe this image"
#
# =============================================
# 8. Voice Assistant:
# =============================================
# → Whisper (STT) + Ollama + TTS
# → สร้าง Jarvis ของตัวเอง
# → ทุกอย่าง Local 100%
Performance Tuning
# =============================================
# Speed Optimization:
# =============================================
#
# 1. Use Quantized Model:
# → Q4 แทน FP16
# → 3-4x เร็วขึ้น
#
# 2. Flash Attention:
# → ollama serve --flash-attention
# → ประหยัด VRAM 30%
#
# 3. GPU Layer Offloading:
# → Set OLLAMA_NUM_GPU_LAYERS
# → Balance CPU/GPU
#
# 4. Context Length:
# → PARAMETER num_ctx 4096 (default 2048)
# → ใหญ่ขึ้น = ใช้ RAM เยอะ
#
# 5. Batch Processing:
# → Process ทีละหลาย requests
# → ใช้ /api/embed endpoint
#
# =============================================
# Memory Management:
# =============================================
# export OLLAMA_KEEP_ALIVE=1h
# → Model อยู่ใน Memory นาน
# → ไม่ต้อง Reload ทุกครั้ง
#
# export OLLAMA_MAX_LOADED_MODELS=2
# → โหลดได้ 2 Models พร้อมกัน
#
# =============================================
# Monitor:
# =============================================
# watch -n 1 nvidia-smi
# htop
# docker stats (ถ้าใช้ Docker)
Local LLM for Trading
# =============================================
# Trading Analysis with Local LLM:
# =============================================
#
# Use Case: วิเคราะห์ข่าว Forex
#
# import requests
#
# def analyze_news(news_text):
# response = requests.post(
# "http://localhost:11434/api/generate",
# json={
# "model": "llama3.1:8b",
# "prompt": f"""
# Analyze this Forex news and predict USD impact:
# {news_text}
#
# Output JSON:
# {"impact": "bullish/bearish",
# "confidence": 0-100,
# "affected_pairs": [...],
# "reasoning": "..."
# }""",
# "stream": False
# }
# )
# return response.json()["response"]
#
# =============================================
# Combine with Trading Signals:
# =============================================
# 1. รับสัญญาณจาก iCafeFX
# 2. ใช้ LLM วิเคราะห์ข่าว
# 3. Confirm/Reject signal
# 4. ลดสัญญาณผิดพลาด
#
# =============================================
# Privacy Benefits:
# =============================================
# ✓ กลยุทธ์ไม่รั่วไป OpenAI
# ✓ ข้อมูล portfolio ไม่ออกเน็ต
# ✓ Trade history ปลอดภัย
Comparison: Local vs Cloud LLM
| Feature | Local (Ollama) | Cloud (GPT-5/Claude) |
|---|---|---|
| Privacy | 100% | Depends on policy |
| Cost | One-time GPU | Per-token pricing |
| Speed | 20-100 tok/s | 50-200 tok/s |
| Quality | 85-95% | 100% (best) |
| Offline | Yes | No |
| Customization | Full control | Limited |
| Rate Limit | None | Yes |
| Setup | 1-2 hours | 5 minutes |
Cost Analysis
# =============================================
# Cost Comparison (1 year usage):
# =============================================
#
# Heavy User (1M tokens/day):
#
# Cloud (GPT-5):
# → Input: $5/M tokens
# → Output: $15/M tokens
# → Avg: $10/M × 365M = $3,650/ปี
# → ~130K บาท/ปี
#
# Cloud (Claude Opus 4.7):
# → Input: $15/M
# → Output: $75/M
# → Avg: $45/M × 365M = $16,425/ปี!
# → 580K บาท/ปี
#
# Local (Ollama):
# → GPU RTX 4090: 70K บาท (one-time)
# → Electricity: 400W × 24h × 365 × 4 บาท/kWh
# → = 14,000 บาท/ปี
# → Total Year 1: 84K บาท
# → Year 2+: แค่ 14K/ปี
#
# =============================================
# Break-even:
# =============================================
# Heavy user คืนทุน GPU ใน 6-8 เดือน!
# Medium user คืนทุน 12-18 เดือน
# Light user ใช้ Cloud ดีกว่า
สำหรับ Developer หรือ Tech Enthusiast ที่สนใจการเทรด Forex/Gold การใช้ Local LLM วิเคราะห์ข่าวร่วมกับสัญญาณจาก iCafeFX เป็นวิธีที่ล้ำสมัยและรักษา Privacy ได้ 100% ในปี 2026 Ollama ทำให้การรัน AI ที่บ้านง่ายขึ้นมาก และด้วย RTX 4090 ราคาประมาณ 70,000 บาท คืนทุนเร็วกว่าที่คิด
Checklist Self-Hosted AI 2026
# =============================================
# LOCAL LLM CHECKLIST:
# =============================================
#
# Hardware:
# □ 1. GPU VRAM >= 8GB (แนะนำ 24GB)
# □ 2. RAM >= 32GB
# □ 3. SSD 100GB+ ว่างพื้นที่
# □ 4. Ubuntu/Windows/macOS
#
# Software:
# □ 5. Install Ollama
# □ 6. Pull Model (Llama 3.1, Gemma 2)
# □ 7. Install Open WebUI
# □ 8. Set GPU driver (NVIDIA/AMD)
#
# Configuration:
# □ 9. Adjust quantization
# □ 10. Set context length
# □ 11. Configure keep-alive
# □ 12. Enable flash attention
#
# Integration:
# □ 13. Python SDK
# □ 14. Open WebUI
# □ 15. Continue.dev (VSCode)
# □ 16. Custom Modelfile
#
# Performance:
# □ 17. Benchmark speed
# □ 18. Monitor GPU utilization
# □ 19. Memory management
# □ 20. Security (firewall)
สรุป: Local LLM = Future of AI
Local LLM กับ Ollama เป็นทางเลือกที่น่าสนใจในปี 2026 สำหรับ Developer ที่ต้องการ Privacy, ไม่มี Rate Limit, และไม่เสียค่าใช้จ่ายต่อเนื่อง ด้วย Llama 3.1 70B, Gemma 2 27B, หรือ DeepSeek Coder 33B สามารถทดแทนงาน GPT-4 / Claude ได้ถึง 85-95% โดยใช้ GPU เพียงเครื่องเดียว เหมาะสำหรับองค์กรที่จริงจังเรื่อง Privacy, Developer ที่ต้องการ Build AI Features และผู้ที่ชอบทดลองเทคโนโลยีใหม่ๆ
