LLM Fine-tuning LoRA Clean Architecture —

Q: LoRA rank เท่าไหร่ดีที่สุด?

ขึ้นอยู่กับ task complexity สำหรับ task ง่ายเช่น sentiment classification rank 8 เพียงพอ สำหรับ task ปานกลางเช่น instruction following rank 16 เหมาะสม สำหรับ task ซับซ้อนเช่น code generation หรือ domain-specific knowledge rank 32-64 อาจจำเป็น แนะนำเริ่มจาก rank 16 แล้วทดลองเพิ่มลดดูผลลัพธ์

Q: QLoRA กับ LoRA ผลลัพธ์ต่างกันมากไหม?

งานวิจัยของ Dettmers et al. แสดงว่า QLoRA ให้ผลลัพธ์ใกล้เคียง full 16-bit fine-tuning มากโดยใช้ memory น้อยกว่า 4 เท่า quality loss จาก quantization น้อยมาก (ต่ำกว่า 1% ในงานส่วนใหญ่) สำหรับ GPU RAM จำกัดแนะนำใช้ QLoRA เสมอ

Q: ข้อมูลสำหรับ fine-tune ต้องมีเท่าไหร่?

LoRA ทำงานได้ดีกับข้อมูลน้อยกว่า full fine-tuning สำหรับ instruction tuning ข้อมูล 1,000-10,000 ตัวอย่างเพียงพอ สำหรับ domain adaptation ข้อมูล 10,000-100,000 ตัวอย่างให้ผลดี คุณภาพสำคัญกว่าปริมาณ ข้อมูลที่ curated มาอย่างดี 1,000 ตัวอย่างอาจดีกว่าข้อมูลคุณภาพต่ำ 100,000 ตัวอย่าง

Q: สามารถรวม LoRA หลายตัวเข้าด้วยกันได้ไหม?

ได้ สามารถ stack LoRA adapters หลายตัวได้ เช่นมี base LoRA สำหรับ instruction following แล้วเพิ่ม domain LoRA สำหรับ medical knowledge นอกจากนี้ยังสามารถ merge LoRA กลับเข้า base model แล้ว fine-tune LoRA ใหม่อีกรอบได้ ทำให้สร้าง specialized models ได้หลากหลายจาก base model เดียว

LoRA คืออะไรและทำไมถึงปฏิวัติการ Fine-tune LLM

LLM Fine-tuning LoRA Clean Architecture —

LoRA (Low-Rank Adaptation) เป็นเทคนิค Parameter-Efficient Fine-Tuning (PEFT) ที่ช่วยให้ fine-tune Large Language Models ได้โดยใช้ทรัพยากรน้อยลงอย่างมาก แทนที่จะ update weight ทั้งหมดของ model LoRA จะ freeze weights เดิมไว้และเพิ่ม trainable low-rank matrices เข้าไปในแต่ละ layer

หลักการทำงานของ LoRA คือ สำหรับ weight matrix W ขนาด d x k แทนที่จะ update W โดยตรง LoRA จะเพิ่ม delta W = B x A โดยที่ B มีขนาด d x r และ A มีขนาด r x k เมื่อ r (rank) มีค่าน้อยมากเช่น 8 หรือ 16 จำนวน trainable parameters จะลดลงอย่างมหาศาล

ตัวอย่างเช่น Llama 2 7B มี parameters 7 พันล้านตัว full fine-tuning ต้องใช้ GPU RAM มากกว่า 60GB แต่ LoRA ด้วย rank 16 มี trainable parameters เพียง 4-8 ล้านตัว (ประมาณ 0.1% ของ model) ใช้ GPU RAM แค่ 16-24GB ทำให้ fine-tune ได้บน consumer GPU เช่น RTX 4090

QLoRA (Quantized LoRA) เป็นการรวม LoRA กับ 4-bit quantization ทำให้ใช้ memory น้อยลงอีก สามารถ fine-tune model 7B ด้วย GPU RAM เพียง 6-8GB หรือ fine-tune model 70B ด้วย GPU RAM 48GB ซึ่งเป็นไปไม่ได้เลยกับ full fine-tuning

สถาปัตยกรรม LoRA และ QLoRA เชิงลึก

รายละเอียดทางเทคนิคของ LoRA architecture

LoRA Architecture Deep Dive

=== Original Weight Update ===

Full fine-tuning: W_new = W + delta_W

delta_W มีขนาด d x k (เท่ากับ W)

Parameters: d * k = ล้านๆ parameters

=== LoRA Decomposition ===

LoRA: W_new = W + B * A

W: d x k (frozen, ไม่ update)

B: d x r (trainable)

A: r x k (trainable)

r << min(d, k) เช่น r = 8, 16, 32, 64

Parameters: r * (d + k) = น้อยมาก

ตัวอย่าง: d=4096, k=4096, r=16

Full: 4096 * 4096 = 16,777,216

LoRA: 16 * (4096 + 4096) = 131,072 (0.78%)

=== LoRA Hyperparameters ===

r (rank): ขนาดของ low-rank matrices

r=8: น้อยสุด, เร็วสุด, อาจไม่พอสำหรับ task ซับซ้อน

r=16: balance ดี, เหมาะกับงานส่วนใหญ่

r=32: quality สูงขึ้น, ใช้ memory มากขึ้น

r=64: ใกล้เคียง full fine-tuning

alpha: scaling factor

alpha = r: scaling = 1 (ค่าเริ่มต้น)

alpha = 2*r: scaling = 2 (LoRA มีผลมากขึ้น)

สูตร: scaling = alpha / r

target_modules: layers ที่จะใส่ LoRA

Attention layers: q_proj, k_proj, v_proj, o_proj

เนื้อหาเกี่ยวข้อง — อ่านต่อ: smart contract coins list

MLP layers: gate_proj, up_proj, down_proj

แนะนำเริ่มจาก attention layers ก่อน

dropout: LoRA dropout

0.05-0.1: ป้องกัน overfitting

=== QLoRA Specifics ===

4-bit NormalFloat (NF4) quantization

Double quantization สำหรับ quantization constants

Paged optimizers สำหรับ memory management

Memory comparison (Llama 2 7B):

Full FP16: ~28 GB

แนะนำเพิ่มเติม — SiamCafeBook

LoRA FP16: ~16 GB

QLoRA 4-bit: ~6 GB

=== LoRA Variants ===

DoRA: Weight-Decomposed Low-Rank Adaptation

แยก magnitude กับ direction ของ weight

ผลลัพธ์ดีกว่า LoRA เล็กน้อย

AdaLoRA: Adaptive LoRA

ปรับ rank ของแต่ละ layer อัตโนมัติ

layers ที่สำคัญได้ rank สูงกว่า

LoRA+: ใช้ learning rate ต่างกันสำหรับ A และ B

B ใช้ lr สูงกว่า A ประมาณ 2-4 เท่า

Fine-tune LLM ด้วย LoRA และ Hugging Face PEFT

โค้ดสำหรับ fine-tune Llama model ด้วย QLoRA

#!/usr/bin/env python3
# finetune_lora.py — Fine-tune LLM with QLoRA
import torch
from transformers import (
    AutoModelForCausalLM, AutoTokenizer,
    TrainingArguments, BitsAndBytesConfig
)
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from trl import SFTTrainer
from datasets import load_dataset

# === Configuration ===
MODEL_NAME = "meta-llama/Llama-2-7b-hf"
DATASET_NAME = "timdettmers/openassistant-guanaco"
OUTPUT_DIR = "./lora-llama2-7b"

# 4-bit quantization config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)

# Load model with quantization
model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True,
)
model = prepare_model_for_kbit_training(model)

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

# LoRA configuration
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 4,194,304 || all params: 6,742,609,920 || trainable%: 0.0622

# Load dataset
dataset = load_dataset(DATASET_NAME, split="train")

# Training arguments
training_args = TrainingArguments(
    output_dir=OUTPUT_DIR,
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,
    weight_decay=0.001,
    warmup_ratio=0.03,
    lr_scheduler_type="cosine",
    logging_steps=10,
    save_strategy="epoch",
    fp16=False,
    bf16=True,
    max_grad_norm=0.3,
    optim="paged_adamw_32bit",
    group_by_length=True,
    report_to="tensorboard",
)

# Initialize trainer
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    tokenizer=tokenizer,
    args=training_args,
    dataset_text_field="text",
    max_seq_length=1024,
    packing=True,
)

# Train
trainer.train()

# Save LoRA adapter (only ~20MB)
trainer.model.save_pretrained(OUTPUT_DIR)
tokenizer.save_pretrained(OUTPUT_DIR)
print(f"LoRA adapter saved to {OUTPUT_DIR}")

จัดโครงสร้างโปรเจกต์แบบ Clean Architecture

โครงสร้างโปรเจกต์ที่ดีสำหรับ LLM fine-tuning

Clean Architecture สำหรับ LLM Fine-tuning Project

llm-finetune/

├── configs/

│ ├── base.yaml # base configuration

│ ├── lora_llama7b.yaml # model-specific config

│ └── lora_mistral7b.yaml

├── src/

│ ├── __init__.py

│ ├── data/

เนื้อหาเกี่ยวข้อง — ทำความเข้าใจ Model Registry กับ RBAC ABAC Policy —

│ │ ├── __init__.py

│ │ ├── dataset.py # dataset loading and processing

│ │ ├── formatter.py # prompt formatting

│ │ └── collator.py # data collation

│ ├── model/

│ │ ├── __init__.py

│ │ ├── loader.py # model loading with quantization

│ │ ├── lora_setup.py # LoRA configuration

│ │ └── merging.py # merge LoRA weights

│ ├── training/

│ │ ├── __init__.py

│ │ ├── trainer.py # custom trainer

│ │ ├── callbacks.py # training callbacks

│ │ └── metrics.py # evaluation metrics

│ ├── inference/

│ │ ├── __init__.py

│ │ ├── generate.py # text generation

แนะนำเพิ่มเติม — XM Signal

│ │ └── server.py # API server

│ └── utils/

│ ├── __init__.py

│ ├── config.py # configuration management

│ └── logging.py # logging setup

├── scripts/

│ ├── train.py # training entrypoint

│ ├── evaluate.py # evaluation script

│ ├── merge_lora.py # merge LoRA to base model

│ └── serve.py # start inference server

├── tests/

│ ├── test_data.py

│ ├── test_model.py

│ └── test_inference.py

เนื้อหาเกี่ยวข้อง — แนะนำให้อ่าน Snyk Code Security Zero Downtime Deployment —

├── Dockerfile

├── requirements.txt

└── README.md

=== configs/lora_llama7b.yaml ===

model:

name: meta-llama/Llama-2-7b-hf

quantization:

enabled: true

bits: 4

quant_type: nf4

double_quant: true

compute_dtype: bfloat16

lora:

r: 16

alpha: 32

dropout: 0.05

target_modules:

q_proj
k_proj
v_proj
o_proj
gate_proj
up_proj
down_proj

bias: none

training:

epochs: 3

batch_size: 4

gradient_accumulation: 4

learning_rate: 2e-4

weight_decay: 0.001

warmup_ratio: 0.03

scheduler: cosine

max_seq_length: 2048

optimizer: paged_adamw_32bit

bf16: true

max_grad_norm: 0.3

data:

dataset: timdettmers/openassistant-guanaco

prompt_template: alpaca

packing: true

=== src/model/loader.py ===

import yaml

import torch

เนื้อหาเกี่ยวข้อง — ดูเพิ่มเติมเรื่อง WordPress WooCommerce AR VR Development

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

class ModelLoader:

def init(self, config_path):

with open(config_path) as f:

self.config = yaml.safe_load(f)

def load_model(self):

model_cfg = self.config["model"]

bnb_config = None

if model_cfg["quantization"]["enabled"]:

q = model_cfg["quantization"]

bnb_config = BitsAndBytesConfig(

load_in_4bit=(q["bits"] == 4),

bnb_4bit_quant_type=q["quant_type"],

bnb_4bit_compute_dtype=getattr(torch, q["compute_dtype"]),

bnb_4bit_use_double_quant=q["double_quant"],

)

model = AutoModelForCausalLM.from_pretrained(

model_cfg["name"],

quantization_config=bnb_config,

device_map="auto",

trust_remote_code=True,

)

tokenizer = AutoTokenizer.from_pretrained(model_cfg["name"])

tokenizer.pad_token = tokenizer.eos_token

return model, tokenizer

Training Pipeline และ Hyperparameter Tuning

สร้าง training pipeline ที่รองรับ experiment tracking

#!/usr/bin/env python3
# src/training/trainer.py — Custom Training Pipeline
import os
import yaml
import torch
from transformers import TrainingArguments, EarlyStoppingCallback
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from trl import SFTTrainer
from datasets import load_dataset
import wandb

class LLMTrainer:
    def __init__(self, config_path):
        with open(config_path) as f:
            self.config = yaml.safe_load(f)
        
        self.model = None
        self.tokenizer = None
        self.trainer = None
    
    def setup_model(self, model, tokenizer):
        model = prepare_model_for_kbit_training(model)
        
        lora_cfg = self.config["lora"]
        peft_config = LoraConfig(
            r=lora_cfg["r"],
            lora_alpha=lora_cfg["alpha"],
            target_modules=lora_cfg["target_modules"],
            lora_dropout=lora_cfg["dropout"],
            bias=lora_cfg["bias"],
            task_type="CAUSAL_LM",
        )
        
        self.model = get_peft_model(model, peft_config)
        self.tokenizer = tokenizer
        
        trainable = sum(p.numel() for p in self.model.parameters() if p.requires_grad)
        total = sum(p.numel() for p in self.model.parameters())
        print(f"Trainable: {trainable:,} / {total:,} ({100*trainable/total:.4f}%)")
    
    def load_data(self):
        data_cfg = self.config["data"]
        dataset = load_dataset(data_cfg["dataset"])
        
        if "train" in dataset:
            train_data = dataset["train"]
        else:
            train_data = dataset
        
        if "validation" in dataset:
            eval_data = dataset["validation"]
        else:
            split = train_data.train_test_split(test_size=0.05, seed=42)
            train_data = split["train"]
            eval_data = split["test"]
        
        return train_data, eval_data
    
    def train(self):
        train_data, eval_data = self.load_data()
        t_cfg = self.config["training"]
        
        output_dir = f"./outputs/{self.config['model']['name'].split('/')[-1]}-lora-r{self.config['lora']['r']}"
        
        training_args = TrainingArguments(
            output_dir=output_dir,
            num_train_epochs=t_cfg["epochs"],
            per_device_train_batch_size=t_cfg["batch_size"],
            gradient_accumulation_steps=t_cfg["gradient_accumulation"],
            learning_rate=t_cfg["learning_rate"],
            weight_decay=t_cfg["weight_decay"],
            warmup_ratio=t_cfg["warmup_ratio"],
            lr_scheduler_type=t_cfg["scheduler"],
            logging_steps=10,
            save_strategy="steps",
            save_steps=200,
            eval_strategy="steps",
            eval_steps=200,
            load_best_model_at_end=True,
            metric_for_best_model="eval_loss",
            bf16=t_cfg.get("bf16", True),
            max_grad_norm=t_cfg["max_grad_norm"],
            optim=t_cfg["optimizer"],
            group_by_length=True,
            report_to="wandb",
        )
        
        self.trainer = SFTTrainer(
            model=self.model,
            train_dataset=train_data,
            eval_dataset=eval_data,
            tokenizer=self.tokenizer,
            args=training_args,
            dataset_text_field="text",
            max_seq_length=t_cfg["max_seq_length"],
            packing=self.config["data"].get("packing", True),
            callbacks=[EarlyStoppingCallback(early_stopping_patience=3)],
        )
        
        self.trainer.train()
        self.trainer.model.save_pretrained(output_dir)
        self.tokenizer.save_pretrained(output_dir)
        print(f"Training complete. Model saved to {output_dir}")
        return output_dir

# === scripts/train.py ===
# if __name__ == "__main__":
#     import sys
#     config = sys.argv[1] if len(sys.argv) > 1 else "configs/lora_llama7b.yaml"
#     
#     from src.model.loader import ModelLoader
#     loader = ModelLoader(config)
#     model, tokenizer = loader.load_model()
#     
#     trainer = LLMTrainer(config)
#     trainer.setup_model(model, tokenizer)
#     output = trainer.train()

# === Merge LoRA to base model ===
# scripts/merge_lora.py
from peft import PeftModel

def merge_lora(base_model_name, lora_path, output_path):
    model = AutoModelForCausalLM.from_pretrained(
        base_model_name, torch_dtype=torch.float16, device_map="auto"
    )
    model = PeftModel.from_pretrained(model, lora_path)
    model = model.merge_and_unload()
    model.save_pretrained(output_path)
    print(f"Merged model saved to {output_path}")

Deploy และ Serve LoRA Model

Deploy LoRA model สำหรับ inference

#!/usr/bin/env python3
# src/inference/server.py — FastAPI Inference Server
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import uvicorn

app = FastAPI(title="LLM LoRA Inference API")

class GenerateRequest(BaseModel):
    prompt: str
    max_new_tokens: int = 256
    temperature: float = 0.7
    top_p: float = 0.9
    top_k: int = 50
    repetition_penalty: float = 1.1

class GenerateResponse(BaseModel):
    generated_text: str
    tokens_generated: int

# Load model at startup
MODEL = None
TOKENIZER = None

@app.on_event("startup")
async def load_model():
    global MODEL, TOKENIZER
    
    base_model = "meta-llama/Llama-2-7b-hf"
    lora_path = "./outputs/Llama-2-7b-hf-lora-r16"
    
    TOKENIZER = AutoTokenizer.from_pretrained(base_model)
    
    MODEL = AutoModelForCausalLM.from_pretrained(
        base_model, torch_dtype=torch.float16, device_map="auto"
    )
    MODEL = PeftModel.from_pretrained(MODEL, lora_path)
    MODEL.eval()
    print("Model loaded successfully")

@app.post("/generate", response_model=GenerateResponse)
async def generate(req: GenerateRequest):
    if MODEL is None:
        raise HTTPException(500, "Model not loaded")
    
    inputs = TOKENIZER(req.prompt, return_tensors="pt").to(MODEL.device)
    
    with torch.no_grad():
        outputs = MODEL.generate(
            **inputs,
            max_new_tokens=req.max_new_tokens,
            temperature=req.temperature,
            top_p=req.top_p,
            top_k=req.top_k,
            repetition_penalty=req.repetition_penalty,
            do_sample=True,
        )
    
    generated = TOKENIZER.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
    
    return GenerateResponse(
        generated_text=generated,
        tokens_generated=outputs.shape[1] - inputs["input_ids"].shape[1]
    )

@app.get("/health")
async def health():
    return {"status": "ok", "model_loaded": MODEL is not None}

if __name__ == "__main__":
    uvicorn.run(app, host="0.0.0.0", port=8000)

# === Dockerfile ===
# FROM nvidia/cuda:12.1.0-runtime-ubuntu22.04
# RUN apt-get update && apt-get install -y python3 python3-pip
# WORKDIR /app
# COPY requirements.txt .
# RUN pip3 install -r requirements.txt
# COPY . .
# CMD ["python3", "scripts/serve.py"]

# === requirements.txt ===
# torch>=2.1.0
# transformers>=4.36.0
# peft>=0.7.0
# trl>=0.7.0
# bitsandbytes>=0.41.0
# datasets>=2.14.0
# accelerate>=0.25.0
# fastapi>=0.104.0
# uvicorn>=0.24.0
# wandb>=0.16.0
# pyyaml>=6.0

FAQ คำถามที่พบบ่อย

Q: LoRA rank เท่าไหร่ดีที่สุด?

A: ขึ้นอยู่กับ task complexity สำหรับ task ง่ายเช่น sentiment classification rank 8 เพียงพอ สำหรับ task ปานกลางเช่น instruction following rank 16 เหมาะสม สำหรับ task ซับซ้อนเช่น code generation หรือ domain-specific knowledge rank 32-64 อาจจำเป็น แนะนำเริ่มจาก rank 16 แล้วทดลองเพิ่มลดดูผลลัพธ์

Q: QLoRA กับ LoRA ผลลัพธ์ต่างกันมากไหม?

A: งานวิจัยของ Dettmers et al. แสดงว่า QLoRA ให้ผลลัพธ์ใกล้เคียง full 16-bit fine-tuning มากโดยใช้ memory น้อยกว่า 4 เท่า quality loss จาก quantization น้อยมาก (ต่ำกว่า 1% ในงานส่วนใหญ่) สำหรับ GPU RAM จำกัดแนะนำใช้ QLoRA เสมอ

Q: ข้อมูลสำหรับ fine-tune ต้องมีเท่าไหร่?

A: LoRA ทำงานได้ดีกับข้อมูลน้อยกว่า full fine-tuning สำหรับ instruction tuning ข้อมูล 1,000-10,000 ตัวอย่างเพียงพอ สำหรับ domain adaptation ข้อมูล 10,000-100,000 ตัวอย่างให้ผลดี คุณภาพสำคัญกว่าปริมาณ ข้อมูลที่ curated มาอย่างดี 1,000 ตัวอย่างอาจดีกว่าข้อมูลคุณภาพต่ำ 100,000 ตัวอย่าง

Q: สามารถรวม LoRA หลายตัวเข้าด้วยกันได้ไหม?

A: ได้ สามารถ stack LoRA adapters หลายตัวได้ เช่นมี base LoRA สำหรับ instruction following แล้วเพิ่ม domain LoRA สำหรับ medical knowledge นอกจากนี้ยังสามารถ merge LoRA กลับเข้า base model แล้ว fine-tune LoRA ใหม่อีกรอบได้ ทำให้สร้าง specialized models ได้หลากหลายจาก base model เดียว

LLM Fine-tuning LoRA Clean Architecture —

LoRA คืออะไรและทำไมถึงปฏิวัติการ Fine-tune LLM

สถาปัตยกรรม LoRA และ QLoRA เชิงลึก

Memory comparison (Llama 2 7B):

Fine-tune LLM ด้วย LoRA และ Hugging Face PEFT

จัดโครงสร้างโปรเจกต์แบบ Clean Architecture

model:

quantization:

lora:

target_modules:

training:

data:

class ModelLoader:

def __init__(self, config_path):

with open(config_path) as f:

def load_model(self):

if model_cfg["quantization"]["enabled"]:

Training Pipeline และ Hyperparameter Tuning

Deploy และ Serve LoRA Model

FAQ คำถามที่พบบ่อย

บทความที่เกี่ยวข้อง

แนะนำจากเครือข่าย SiamCafe

บทความที่เกี่ยวข้อง

def init(self, config_path):