LLM Quantization GGUF Citizen Developer

LLM Quantization GGUF

LLM Quantization GGUF Citizen Developer llama.cpp Ollama LM Studio Local AI FP16 INT8 INT4 RAM Laptop GPU Hugging Face

Quantization	Bits	RAM (7B)	Quality	Speed
FP32	32	28GB	100%	ช้า
FP16	16	14GB	99.9%	ปานกลาง
Q8_0	8	7GB	99%	เร็ว
Q5_K_M	5	5GB	97%	เร็ว
Q4_K_M	4	4GB	95%	เร็วมาก
Q3_K_M	3	3.5GB	90%	เร็วมาก
Q2_K	2	2.5GB	80%	เร็วที่สุด

Local LLM Setup

# === Local LLM Setup Guide ===

# Method 1: Ollama (Easiest)
# Install: https://ollama.ai
# curl -fsSL https://ollama.ai/install.sh | sh
#
# Run a model:
# ollama run llama3          # Llama 3 8B (4.7GB)
# ollama run mistral         # Mistral 7B (4.1GB)
# ollama run codellama       # Code Llama 7B (3.8GB)
# ollama run phi3            # Phi-3 Mini (2.3GB)
# ollama run gemma2          # Gemma 2 9B (5.4GB)
#
# API Server (for integration):
# curl http://localhost:11434/api/generate -d '{
#   "model": "llama3",
#   "prompt": "Explain Docker in simple terms"
# }'

# Method 2: llama.cpp (Advanced, GPU acceleration)
# git clone https://github.com/ggerganov/llama.cpp
# cd llama.cpp
# make -j$(nproc) LLAMA_CUDA=1  # With CUDA GPU
# # or: make -j$(nproc) LLAMA_METAL=1  # macOS Metal
#
# # Download GGUF model from Hugging Face
# # https://huggingface.co/TheBloke
# ./llama-cli -m models/llama-3-8b-Q4_K_M.gguf \
#   -p "Explain Kubernetes:" \
#   -n 512 -ngl 35  # 35 layers on GPU

# Method 3: Python with llama-cpp-python
# pip install llama-cpp-python
# from llama_cpp import Llama
# llm = Llama(model_path="models/llama-3-8b-Q4_K_M.gguf",
#             n_gpu_layers=35, n_ctx=4096)
# output = llm("Explain Docker:", max_tokens=512)
# print(output['choices'][0]['text'])

from dataclasses import dataclass

@dataclass
class SetupMethod:
    method: str
    difficulty: str
    gpu_support: str
    features: str
    best_for: str

methods = [
    SetupMethod("Ollama",
        "ง่ายมาก (1 คำสั่ง)",
        "Auto-detect CUDA Metal",
        "CLI + API Server + Model Library + Pull อัตโนมัติ",
        "Citizen Developer เริ่มต้น ใช้งานเร็ว"),
    SetupMethod("LM Studio",
        "ง่าย (GUI)",
        "CUDA Metal Vulkan",
        "GUI Chat + Model Browser + API Server + Parameter Tuning",
        "ไม่ชอบ Command Line ต้องการ GUI"),
    SetupMethod("llama.cpp",
        "ปานกลาง (Compile)",
        "CUDA Metal Vulkan OpenCL",
        "เร็วที่สุด + Quantize เอง + Full Control",
        "Advanced User ต้องการ Performance สูงสุด"),
    SetupMethod("GPT4All",
        "ง่าย (GUI)",
        "CUDA Metal",
        "GUI + หลาย Model + Local Document Chat",
        "ต้องการ Chat กับ Document (RAG)"),
    SetupMethod("Python (llama-cpp-python)",
        "ปานกลาง (Code)",
        "CUDA Metal",
        "Python API + Integration กับ App + LangChain",
        "Developer สร้าง App ที่ใช้ LLM"),
]

print("=== Setup Methods ===")
for m in methods:
    print(f"\n  [{m.method}] Difficulty: {m.difficulty}")
    print(f"    GPU: {m.gpu_support}")
    print(f"    Features: {m.features}")
    print(f"    Best for: {m.best_for}")

Model Selection

# === Choosing the Right Model ===

@dataclass
class ModelRecommendation:
    model: str
    size: str
    ram_needed: str
    strength: str
    use_case: str
    ollama_cmd: str

models = [
    ModelRecommendation("Llama 3 8B",
        "4.7GB (Q4_K_M)",
        "8GB RAM",
        "General Purpose ดีที่สุดในขนาด 8B",
        "Chat, Q&A, Writing, Analysis",
        "ollama run llama3"),
    ModelRecommendation("Mistral 7B",
        "4.1GB (Q4_K_M)",
        "8GB RAM",
        "เร็ว Instruction Following ดี",
        "Chat, Code, Translation",
        "ollama run mistral"),
    ModelRecommendation("Code Llama 7B",
        "3.8GB (Q4_K_M)",
        "8GB RAM",
        "เขียน Code ดีมาก หลายภาษา",
        "Code Generation, Debug, Explain Code",
        "ollama run codellama"),
    ModelRecommendation("Phi-3 Mini 3.8B",
        "2.3GB (Q4_K_M)",
        "4GB RAM",
        "เล็กมาก รันบน RAM น้อยได้",
        "Simple Chat, Q&A (RAM จำกัด)",
        "ollama run phi3"),
    ModelRecommendation("Gemma 2 9B",
        "5.4GB (Q4_K_M)",
        "8GB RAM",
        "จาก Google คุณภาพดีมาก",
        "Chat, Writing, Analysis",
        "ollama run gemma2"),
    ModelRecommendation("Llama 3 70B",
        "40GB (Q4_K_M)",
        "48GB RAM",
        "คุณภาพใกล้ GPT-4 มาก",
        "Complex Reasoning, Analysis (ต้อง RAM เยอะ)",
        "ollama run llama3:70b"),
]

print("=== Model Recommendations ===")
for m in models:
    print(f"\n  [{m.model}] Size: {m.size}")
    print(f"    RAM: {m.ram_needed}")
    print(f"    Strength: {m.strength}")
    print(f"    Use: {m.use_case}")
    print(f"    Command: {m.ollama_cmd}")

Use Cases

# === Citizen Developer Use Cases ===

@dataclass
class UseCase:
    use_case: str
    description: str
    model: str
    tools: str
    privacy: str

cases = [
    UseCase("RAG Document Chat",
        "Chat กับ Document ภายใน PDF Word ถามตอบจากข้อมูลบริษัท",
        "Llama 3 8B + Embedding Model",
        "Ollama + LangChain + ChromaDB",
        "ข้อมูลอยู่ Local 100% ไม่ส่งไป Cloud"),
    UseCase("Code Assistant",
        "ช่วยเขียน Code Debug Explain Code Review",
        "Code Llama 7B หรือ Llama 3 8B",
        "Ollama + Continue.dev (VS Code Extension)",
        "Source Code ไม่ออกไปข้างนอก"),
    UseCase("Translation",
        "แปลเอกสาร Email บทความ หลายภาษา",
        "Llama 3 8B หรือ Mistral 7B",
        "Ollama API + Python Script",
        "เอกสาร Confidential แปลได้ปลอดภัย"),
    UseCase("Content Creation",
        "เขียน Blog Social Media Marketing Copy Draft",
        "Llama 3 8B หรือ Gemma 2 9B",
        "LM Studio Chat / Ollama",
        "Content Strategy ไม่รั่วไหล"),
    UseCase("Data Analysis",
        "ถาม LLM วิเคราะห์ CSV Excel สร้าง SQL Query",
        "Llama 3 8B + Code Llama",
        "Ollama + Open Interpreter / LangChain",
        "ข้อมูลธุรกิจอยู่ Local ปลอดภัย"),
]

print("=== Use Cases ===")
for c in cases:
    print(f"\n  [{c.use_case}]")
    print(f"    Desc: {c.description}")
    print(f"    Model: {c.model}")
    print(f"    Tools: {c.tools}")
    print(f"    Privacy: {c.privacy}")

เคล็ดลับ

Q4_K_M: ใช้ Q4_K_M สมดุลที่สุดระหว่างขนาดและคุณภาพ
Ollama: เริ่มต้นด้วย Ollama ง่ายที่สุด 1 คำสั่งรันได้
RAM: RAM 8GB รันได้ 7B Model, 16GB รันได้ 13B Model
GPU: มี GPU จะเร็วขึ้น 5-10x แต่ไม่มีก็รันได้ (CPU)
Privacy: ข้อมูล Sensitive ใช้ Local LLM ปลอดภัยกว่า Cloud

Quantization คืออะไร

ลดขนาด Model FP32 FP16 INT8 INT4 ใช้ RAM น้อยลง รันเร็วขึ้น รันบน Laptop ได้ คุณภาพลดเล็กน้อย Q4_K_M สมดุลที่สุด

อ่านเพิ่ม: Local LLM 2026 รัน AI ที่เครื่องตัวเองด้วย Ollama คู่มือ Sel · อ่านเพิ่ม: AI และ LLM สำหรับ Developer สอนใช้ OpenAI API, LangChain และ · อ่านเพิ่ม: Home Lab 2026 คู่มือสร้างห้อง Server ที่บ้าน ครบทุกอย่าง