LangChain Agent Performance Tuning เพิ่มความเร็ว

Q: LangChain Agent ช้าเพราะอะไร

LLM API Latency ส่ง Request ไป OpenAI/Anthropic ใช้เวลา 1-10 วินาทีต่อ Call Agent เรียก LLM หลายรอบ (Chain of Thought) ยิ่งช้า Token Generation ยิ่ง Output ยาว ยิ่งช้า Tool Execution เรียก Tool (Search DB API) แต่ละตัวใช้เวลา เรียก Tool ทีละตัว Sequential ไม่ Parallel Memory/Context ยิ่ง Context ยาว ยิ่ง Latency สูง ส่ง History ทั้งหมดทุก Call ใช้ Token มาก ค่าใช้จ่ายสูง Retrieval (RAG) Vector Search ใช้เวลา 100-500ms Document Loading Chunking Embedding ช้า Network Overhead API Call ทุกครั้งมี Network Latency Cold Start ครั้งแรกช้ากว่าครั้งต่อไป สรุป Agent ช้าเพราะ LLM Call หลายรอบ + Tool Sequential + Context ยาว + No Cache

Q: Caching ทำอย่างไร

LLM Response Cache ใช้ LangChain Cache เก็บ Response ของ Query ที่เคยถามแล้ว InMemoryCache เร็วที่สุด หายเมื่อ Restart SQLiteCache เก็บถาวร ใน File RedisCache เร็ว แชร์ข้าม Instance Semantic Cache ใช้ Embedding เปรียบเทียบ Query คล้ายกัน ไม่ต้อง Exact Match ใช้ Redis + Vector Similarity ลด LLM Call 30-70% Tool Result Cache Cache ผลลัพธ์ Tool ที่ไม่เปลี่ยนบ่อย Search Result Cache 5 นาที Database Query Cache 1 นาที API Response Cache ตาม TTL Embedding Cache Cache Embedding ของ Document ที่เคย Embed แล้ว ไม่ต้อง Re-embed ทุกครั้ง ประหยัด API Call + เวลา

Q: Async & Parallel ทำอย่างไร

Async LLM Call ใช้ agenerate ainvoke แทน generate invoke ไม่ Block Thread ระหว่างรอ LLM Response รองรับหลาย Request พร้อมกัน Parallel Tool Execution เรียกหลาย Tool พร้อมกัน แทนทีละตัว ใช้ asyncio.gather รัน Tool Parallel ลด Latency จาก Tool1+Tool2+Tool3 เป็น max(Tool1,Tool2,Tool3) Batch Processing รวมหลาย Request ส่ง LLM ทีเดียว ใช้ abatch ส่ง Batch Request ลด Overhead ต่อ Request Streaming ส่ง Response ทีละ Token ไม่ต้องรอ Generation ทั้งหมด User เห็น Response เร็วขึ้น (Time to First Token) ใช้ astream stream Concurrent Requests ใช้ Semaphore จำกัด Concurrent LLM Calls ป้องกัน Rate Limit จาก API Provider ตั้ง max_concurrent = 5-10 ตาม Plan

Q: Production Optimization มีอะไร

Model Selection ใช้ Model เล็กสำหรับงานง่าย (GPT-3.5 Haiku) ใช้ Model ใหญ่สำหรับงานยาก (GPT-4 Opus) Router เลือก Model ตามความซับซ้อนของ Query Prompt Optimization ลด Prompt Length ให้สั้นที่สุด ใช้ Few-shot แทน Many-shot ใช้ System Prompt สั้น กระชับ ลด Token = ลด Cost + เร็วขึ้น Context Window Management ตัด History เก่าออก เก็บแค่ล่าสุด ใช้ Summary Memory สรุป History เป็นข้อความสั้น ตั้ง max_tokens จำกัด Output Length Retrieval Optimization ใช้ Hybrid Search (Vector + Keyword) ตั้ง top_k ให้เหมาะสม (3-5 ไม่ใช่ 20) Re-rank ด้วย Cross-encoder ก่อนส่ง LLM Infrastructure Load Balancer หน้า LLM API Connection Pooling ลด Connection Overhead Auto-scaling ตาม Request Rate

LangChain Agent Performance

LangChain Agent Performance Tuning เพิ่มความเร็ว Caching Async Parallel Streaming Model Selection Prompt Optimization Production

Optimization	Latency Reduction	Cost Reduction	Complexity
LLM Cache	90%+ (cache hit)	90%+ (no API call)	ต่ำ (ง่าย)
Async/Parallel Tools	50-70%	ไม่เปลี่ยน	ปานกลาง
Streaming	TTFT 80%+ ลด	ไม่เปลี่ยน	ต่ำ
Model Router	30-50%	40-60%	ปานกลาง
Prompt Shortening	20-40%	20-40%	ต่ำ
Retrieval Optimization	30-50%	10-20%	ปานกลาง

Caching Strategy

# === LangChain Caching ===

# from langchain.cache import InMemoryCache, RedisCache, SQLiteCache
# from langchain_community.cache import RedisSemanticCache
# from langchain.globals import set_llm_cache
# import redis
#
# # Option 1: In-Memory Cache (Dev)
# set_llm_cache(InMemoryCache())
#
# # Option 2: Redis Cache (Production)
# redis_client = redis.Redis(host="redis", port=6379)
# set_llm_cache(RedisCache(redis_client, ttl=3600))
#
# # Option 3: Semantic Cache (Similar queries)
# from langchain_openai import OpenAIEmbeddings
# set_llm_cache(RedisSemanticCache(
#     redis_url="redis://redis:6379",
#     embedding=OpenAIEmbeddings(),
#     score_threshold=0.95,  # similarity threshold
#     ttl=3600
# ))
#
# # Tool Result Cache
# from functools import lru_cache
# @lru_cache(maxsize=1000)
# def cached_search(query: str) -> str:
#     return search_tool.run(query)

from dataclasses import dataclass

@dataclass
class CacheConfig:
    cache_type: str
    backend: str
    ttl: str
    hit_rate: str
    use_case: str

caches = [
    CacheConfig("InMemoryCache",
        "Python dict (RAM)",
        "Until restart",
        "สูง (Exact match)",
        "Development Single Instance"),
    CacheConfig("SQLiteCache",
        "SQLite File",
        "Persistent",
        "สูง (Exact match)",
        "Single Server Persistent"),
    CacheConfig("RedisCache",
        "Redis Server",
        "Configurable (1hr default)",
        "สูง (Exact match)",
        "Production Multi-instance Shared"),
    CacheConfig("RedisSemanticCache",
        "Redis + Embeddings",
        "Configurable",
        "สูงมาก (Similar queries hit)",
        "Production Natural Language Queries"),
    CacheConfig("Tool LRU Cache",
        "Python lru_cache",
        "Until eviction (maxsize)",
        "ปานกลาง-สูง",
        "Search DB API results"),
]

print("=== Caching Strategy ===")
for c in caches:
    print(f"  [{c.cache_type}] Backend: {c.backend}")
    print(f"    TTL: {c.ttl} | Hit Rate: {c.hit_rate}")
    print(f"    Use: {c.use_case}")

Async & Streaming

# === Async Parallel Streaming ===

# Async LLM Call
# response = await llm.ainvoke("query")
#
# Parallel Tool Execution
# import asyncio
# async def run_tools_parallel(tools, queries):
#     tasks = [tool.ainvoke(q) for tool, q in zip(tools, queries)]
#     return await asyncio.gather(*tasks)
#
# # Before: Sequential (3s + 2s + 1s = 6s)
# result1 = search_tool.invoke("query1")  # 3s
# result2 = db_tool.invoke("query2")      # 2s
# result3 = calc_tool.invoke("query3")    # 1s
#
# # After: Parallel (max(3s, 2s, 1s) = 3s)
# results = await asyncio.gather(
#     search_tool.ainvoke("query1"),
#     db_tool.ainvoke("query2"),
#     calc_tool.ainvoke("query3"),
# )
#
# # Streaming
# async for chunk in llm.astream("query"):
#     print(chunk.content, end="", flush=True)
#
# # Batch
# responses = await llm.abatch([
#     "query1", "query2", "query3"
# ], config={"max_concurrency": 5})

@dataclass
class AsyncPattern:
    pattern: str
    before_latency: str
    after_latency: str
    improvement: str
    code_change: str

patterns = [
    AsyncPattern("Sequential → Parallel Tools",
        "Tool1(3s) + Tool2(2s) + Tool3(1s) = 6s",
        "max(3s, 2s, 1s) = 3s",
        "50% faster",
        "asyncio.gather(*tool_tasks)"),
    AsyncPattern("Sync → Async LLM",
        "Block thread during LLM call (3s)",
        "Non-blocking await (3s but concurrent)",
        "Throughput 5-10x",
        "await llm.ainvoke() แทน llm.invoke()"),
    AsyncPattern("Full Response → Streaming",
        "Wait 3s → show all text",
        "Show first token 200ms → stream rest",
        "TTFT 90% faster",
        "async for chunk in llm.astream()"),
    AsyncPattern("Single → Batch Request",
        "10 calls × 1s each = 10s",
        "1 batch call = 2s",
        "80% faster",
        "await llm.abatch(queries, max_concurrency=5)"),
]

print("=== Async Patterns ===")
for p in patterns:
    print(f"  [{p.pattern}]")
    print(f"    Before: {p.before_latency}")
    print(f"    After: {p.after_latency}")
    print(f"    Improve: {p.improvement}")
    print(f"    Code: {p.code_change}")

Production Monitoring

# === Performance Monitoring ===

@dataclass
class PerfMetric:
    metric: str
    target: str
    tool: str
    alert: str

metrics = [
    PerfMetric("LLM Latency P99",
        "< 5 seconds",
        "LangSmith / Custom Prometheus",
        "> 10s → Check model Switch to faster"),
    PerfMetric("Time to First Token (TTFT)",
        "< 500ms",
        "LangSmith / Client-side Timer",
        "> 2s → Enable Streaming Check Network"),
    PerfMetric("Cache Hit Rate",
        "> 30%",
        "Redis INFO stats / Custom Counter",
        "< 10% → Review Cache Key Strategy"),
    PerfMetric("Token Usage per Request",
        "< 2000 tokens avg",
        "LangSmith / OpenAI Usage API",
        "> 4000 → Shorten Prompt Trim Context"),
    PerfMetric("Tool Execution Time",
        "< 2 seconds per tool",
        "Custom Timer / LangSmith",
        "> 5s → Cache Tool Results Parallel Exec"),
    PerfMetric("Error Rate",
        "< 1%",
        "LangSmith / Sentry",
        "> 5% → Check API Key Rate Limit Model"),
]

print("=== Performance Metrics ===")
for m in metrics:
    print(f"  [{m.metric}] Target: {m.target}")
    print(f"    Tool: {m.tool}")
    print(f"    Alert: {m.alert}")

เคล็ดลับ

Cache: ใช้ RedisSemanticCache ลด LLM Call 30-70%
Async: ใช้ asyncio.gather รัน Tool Parallel ลด 50%
Streaming: เปิด Streaming ลด TTFT 90%
Model Router: ใช้ Model เล็กสำหรับงานง่าย ประหยัด 40-60%
LangSmith: ใช้ LangSmith ดู Trace หา Bottleneck

LangChain Agent ช้าเพราะอะไร

LLM API Latency Multi-call Chain of Thought Tool Sequential Context ยาว Token มาก Network Cold Start No Cache

Caching ทำอย่างไร

InMemoryCache RedisCache RedisSemanticCache SQLiteCache Tool LRU Cache Embedding Cache TTL Hit Rate 30-70% Reduce LLM Call

Async & Parallel ทำอย่างไร

ainvoke astream abatch asyncio.gather Parallel Tools Streaming TTFT Batch Concurrent Semaphore Rate Limit Non-blocking

Production Optimization มีอะไร

Model Router GPT-3.5 GPT-4 Prompt Short Context Trim Summary Memory Hybrid Search top_k Re-rank Load Balancer Auto-scaling

สรุป

LangChain Agent Performance Tuning Cache Redis Async Parallel Streaming Model Router Prompt Optimization LangSmith Monitoring Production