SiamCafe.net Blog
Technology

LLM Inference vLLM Service Mesh Setup ตั้งค่า LLM Serving ด้วย vLLM และ Istio

llm inference vllm service mesh setup
LLM Inference vLLM Service Mesh Setup | SiamCafe Blog
2026-02-14· อ. บอม — SiamCafe.net· 1,062 คำ

vLLM ?????????????????????

vLLM ???????????? high-throughput LLM inference engine ??????????????????????????????????????????????????? serve Large Language Models ?????????????????????????????????????????????????????? ????????? PagedAttention algorithm ?????????????????? KV cache ?????????????????????????????????????????????????????? ?????? memory waste ??????????????? serve ?????????????????????????????? requests ??????????????????????????????????????????????????????????????? inference frameworks ????????????

???????????????????????????????????????????????? vLLM ?????????????????? PagedAttention ?????????????????? KV cache ????????? paging ?????????????????? virtual memory ?????? OS ?????? memory fragmentation, Continuous Batching ????????? requests ?????????????????????????????????????????????????????????????????? batch ??????????????? ??????????????????????????? batch ????????????, Tensor Parallelism ?????????????????? model ???????????? GPUs ?????????????????? models ????????????????????????????????? 1 GPU, OpenAI-compatible API ?????????????????? OpenAI API ???????????????????????? ?????????????????????????????? base URL

Service Mesh ?????????????????? LLM inference ?????????????????????????????? traffic routing, load balancing, circuit breaking, observability ????????????????????? inference instances ??????????????? scale ????????????????????? failover ??????????????????????????? ?????? mTLS ????????????????????? services ?????????????????????

????????????????????? vLLM ?????????????????? LLM Inference

Setup vLLM server ?????????????????? production

# === vLLM Installation ===

# 1. Install vLLM
pip install vllm

# 2. Start vLLM Server (OpenAI-compatible)
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Meta-Llama-3.1-8B-Instruct \
  --host 0.0.0.0 \
  --port 8000 \
  --tensor-parallel-size 1 \
  --max-model-len 4096 \
  --gpu-memory-utilization 0.9 \
  --dtype auto \
  --enforce-eager false \
  --max-num-seqs 256 \
  --max-num-batched-tokens 32768

# 3. Docker Deployment
cat > Dockerfile.vllm << 'EOF'
FROM vllm/vllm-openai:latest

ENV MODEL_NAME=meta-llama/Meta-Llama-3.1-8B-Instruct
ENV TENSOR_PARALLEL=1
ENV MAX_MODEL_LEN=4096
ENV GPU_MEMORY_UTILIZATION=0.9

EXPOSE 8000

CMD python -m vllm.entrypoints.openai.api_server \
    --model $MODEL_NAME \
    --host 0.0.0.0 \
    --port 8000 \
    --tensor-parallel-size $TENSOR_PARALLEL \
    --max-model-len $MAX_MODEL_LEN \
    --gpu-memory-utilization $GPU_MEMORY_UTILIZATION
EOF

docker build -t vllm-server -f Dockerfile.vllm .
docker run --gpus all -p 8000:8000 vllm-server

# 4. Test API (OpenAI-compatible)
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "What is Kubernetes?"}
    ],
    "max_tokens": 500,
    "temperature": 0.7
  }'

# 5. Kubernetes Deployment
cat > k8s/vllm-deployment.yaml << 'EOF'
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-server
  namespace: llm-inference
spec:
  replicas: 2
  selector:
    matchLabels:
      app: vllm-server
  template:
    metadata:
      labels:
        app: vllm-server
        version: v1
      annotations:
        sidecar.istio.io/inject: "true"
    spec:
      containers:
        - name: vllm
          image: vllm/vllm-openai:latest
          args:
            - "--model"
            - "meta-llama/Meta-Llama-3.1-8B-Instruct"
            - "--host"
            - "0.0.0.0"
            - "--port"
            - "8000"
            - "--tensor-parallel-size"
            - "1"
            - "--max-model-len"
            - "4096"
          ports:
            - containerPort: 8000
          resources:
            limits:
              nvidia.com/gpu: 1
              memory: "24Gi"
            requests:
              nvidia.com/gpu: 1
              memory: "16Gi"
          readinessProbe:
            httpGet:
              path: /health
              port: 8000
            initialDelaySeconds: 120
            periodSeconds: 10
          volumeMounts:
            - name: model-cache
              mountPath: /root/.cache/huggingface
            - name: shm
              mountPath: /dev/shm
      volumes:
        - name: model-cache
          persistentVolumeClaim:
            claimName: model-cache-pvc
        - name: shm
          emptyDir:
            medium: Memory
            sizeLimit: "8Gi"
      tolerations:
        - key: nvidia.com/gpu
          operator: Exists
          effect: NoSchedule
EOF

echo "vLLM installed and configured"

Service Mesh Architecture ?????????????????? LLM

Deploy Istio Service Mesh ?????????????????? LLM inference

# === Istio Service Mesh for LLM ===

# 1. Install Istio
istioctl install --set profile=production
kubectl label namespace llm-inference istio-injection=enabled

# 2. Virtual Service (Traffic Routing)
cat > istio/virtual-service.yaml << 'EOF'
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: vllm-routing
  namespace: llm-inference
spec:
  hosts:
    - vllm-server
  http:
    - match:
        - headers:
            x-model-version:
              exact: "v2"
      route:
        - destination:
            host: vllm-server
            subset: v2
          weight: 100
    - route:
        - destination:
            host: vllm-server
            subset: v1
          weight: 90
        - destination:
            host: vllm-server
            subset: v2
          weight: 10
      timeout: 60s
      retries:
        attempts: 2
        perTryTimeout: 30s
        retryOn: 5xx,reset,connect-failure
---
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: vllm-destination
  namespace: llm-inference
spec:
  host: vllm-server
  trafficPolicy:
    connectionPool:
      http:
        h2UpgradePolicy: UPGRADE
        maxRequestsPerConnection: 100
      tcp:
        maxConnections: 200
    outlierDetection:
      consecutive5xxErrors: 3
      interval: 30s
      baseEjectionTime: 60s
      maxEjectionPercent: 50
    loadBalancer:
      simple: LEAST_REQUEST
  subsets:
    - name: v1
      labels:
        version: v1
    - name: v2
      labels:
        version: v2
EOF

# 3. Rate Limiting
cat > istio/rate-limit.yaml << 'EOF'
apiVersion: networking.istio.io/v1alpha3
kind: EnvoyFilter
metadata:
  name: llm-rate-limit
  namespace: llm-inference
spec:
  workloadSelector:
    labels:
      app: vllm-server
  configPatches:
    - applyTo: HTTP_FILTER
      match:
        context: SIDECAR_INBOUND
      patch:
        operation: INSERT_BEFORE
        value:
          name: envoy.filters.http.local_ratelimit
          typed_config:
            "@type": type.googleapis.com/udpa.type.v1.TypedStruct
            type_url: type.googleapis.com/envoy.extensions.filters.http.local_ratelimit.v3.LocalRateLimit
            value:
              stat_prefix: http_local_rate_limiter
              token_bucket:
                max_tokens: 100
                tokens_per_fill: 50
                fill_interval: 60s
EOF

kubectl apply -f istio/

echo "Service mesh configured"

Load Balancing ????????? Auto-Scaling

????????????????????? load balancing ????????? scaling ?????????????????? LLM

#!/usr/bin/env python3
# llm_scaling.py ??? LLM Inference Scaling Strategy
import json
import logging
from typing import Dict, List

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger("scaling")

class LLMScalingStrategy:
    def __init__(self):
        self.models = {}
    
    def scaling_config(self):
        return {
            "horizontal_scaling": {
                "metric": "GPU utilization + queue depth",
                "min_replicas": 2,
                "max_replicas": 20,
                "scale_up_threshold": "GPU > 80% OR queue > 50 requests",
                "scale_down_threshold": "GPU < 30% AND queue < 5",
                "cooldown_period": "5 minutes",
                "gpu_type": "A10G (24GB VRAM)",
            },
            "model_routing": {
                "small_requests": {
                    "max_tokens": 100,
                    "model": "Llama-3.1-8B",
                    "gpu": "T4 (16GB)",
                    "latency_target": "< 500ms",
                },
                "medium_requests": {
                    "max_tokens": 1000,
                    "model": "Llama-3.1-8B",
                    "gpu": "A10G (24GB)",
                    "latency_target": "< 3s",
                },
                "large_requests": {
                    "max_tokens": 4096,
                    "model": "Llama-3.1-70B",
                    "gpu": "A100 (80GB) x2",
                    "latency_target": "< 15s",
                },
            },
            "cost_optimization": {
                "spot_instances": "Use for batch/async requests (60-90% cheaper)",
                "on_demand": "Use for real-time requests (SLA guarantee)",
                "reserved": "Base capacity (1-3 year commitment, 30-60% cheaper)",
                "model_quantization": "INT8/INT4 reduces GPU memory 2-4x",
            },
        }
    
    def capacity_planning(self, requests_per_second, avg_tokens, model_size_b):
        """Estimate required GPU instances"""
        # Rough estimates based on model size
        tokens_per_second_per_gpu = {
            7: 150,   # 7B model on A10G
            8: 130,   # 8B model on A10G
            13: 80,   # 13B model on A10G
            70: 30,   # 70B model on A100x2
        }
        
        closest = min(tokens_per_second_per_gpu.keys(), key=lambda x: abs(x - model_size_b))
        tps = tokens_per_second_per_gpu[closest]
        
        total_tokens_per_second = requests_per_second * avg_tokens
        gpus_needed = total_tokens_per_second / tps
        
        # Add 30% headroom
        gpus_with_headroom = gpus_needed * 1.3
        
        return {
            "model_size_b": model_size_b,
            "requests_per_second": requests_per_second,
            "avg_tokens_per_request": avg_tokens,
            "total_tokens_per_second": total_tokens_per_second,
            "tokens_per_second_per_gpu": tps,
            "gpus_needed": round(gpus_needed, 1),
            "gpus_with_headroom": round(gpus_with_headroom),
            "estimated_monthly_cost": round(gpus_with_headroom * 730 * 1.0, 2),
        }

strategy = LLMScalingStrategy()
config = strategy.scaling_config()
print("Scaling Config:")
for tier, info in config["model_routing"].items():
    print(f"  {tier}: {info['model']} on {info['gpu']}")

plan = strategy.capacity_planning(requests_per_second=10, avg_tokens=200, model_size_b=8)
print(f"\nCapacity Plan: {plan['gpus_with_headroom']} GPUs needed")
print(f"Est. monthly cost: ")

Performance Optimization

Optimize vLLM performance

# === vLLM Performance Optimization ===

# 1. Model Quantization (reduce memory, increase throughput)
# AWQ Quantization (4-bit)
pip install autoawq
python -c "
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_path = 'meta-llama/Meta-Llama-3.1-8B-Instruct'
quant_path = 'llama-3.1-8b-awq'

model = AutoAWQForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)
model.quantize(tokenizer, quant_config={'zero_point': True, 'q_group_size': 128, 'w_bit': 4})
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)
print('Quantization complete')
"

# Serve quantized model
python -m vllm.entrypoints.openai.api_server \
  --model llama-3.1-8b-awq \
  --quantization awq \
  --max-model-len 4096 \
  --gpu-memory-utilization 0.9

# 2. Prefix Caching (for shared system prompts)
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Meta-Llama-3.1-8B-Instruct \
  --enable-prefix-caching \
  --max-model-len 4096

# 3. Speculative Decoding (use small model to draft)
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Meta-Llama-3.1-70B-Instruct \
  --speculative-model meta-llama/Meta-Llama-3.1-8B-Instruct \
  --num-speculative-tokens 5 \
  --tensor-parallel-size 2

# 4. Benchmark
python -m vllm.entrypoints.openai.api_server &
sleep 30

# Run benchmark
python -m vllm.benchmark_serving \
  --backend openai \
  --model meta-llama/Meta-Llama-3.1-8B-Instruct \
  --dataset-name sharegpt \
  --num-prompts 1000 \
  --request-rate 10

# Key metrics to compare:
# - Throughput (requests/sec)
# - Time to First Token (TTFT)
# - Time per Output Token (TPOT)
# - Inter-token Latency (ITL)

echo "Performance optimization complete"

Monitoring ????????? Observability

Monitor LLM inference pipeline

#!/usr/bin/env python3
# llm_monitoring.py ??? LLM Inference Monitoring
import json
import logging
from datetime import datetime
from typing import Dict

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger("monitor")

class LLMMonitoringDashboard:
    def __init__(self):
        self.metrics = {}
    
    def key_metrics(self):
        return {
            "latency": {
                "ttft_ms": "Time to First Token (target: < 200ms)",
                "tpot_ms": "Time per Output Token (target: < 30ms)",
                "e2e_latency_ms": "End-to-end latency (target: < 5000ms)",
                "query": 'histogram_quantile(0.95, rate(vllm_request_latency_seconds_bucket[5m]))',
            },
            "throughput": {
                "requests_per_second": "Total requests per second",
                "tokens_per_second": "Total tokens generated per second",
                "query": 'rate(vllm_request_success_total[5m])',
            },
            "gpu": {
                "utilization_pct": "GPU compute utilization",
                "memory_used_gb": "GPU memory used",
                "kv_cache_usage_pct": "KV cache memory usage",
                "query": 'vllm_gpu_cache_usage_perc',
            },
            "queue": {
                "pending_requests": "Requests waiting for processing",
                "running_requests": "Requests currently being processed",
                "query": 'vllm_num_requests_running',
            },
            "errors": {
                "error_rate": "Percentage of failed requests",
                "timeout_rate": "Requests exceeding timeout",
                "oom_count": "Out of memory errors",
                "query": 'rate(vllm_request_failure_total[5m])',
            },
        }
    
    def alert_rules(self):
        return [
            {
                "name": "HighLatency",
                "expr": 'histogram_quantile(0.95, rate(vllm_request_latency_seconds_bucket[5m])) > 10',
                "severity": "warning",
                "summary": "P95 latency > 10s",
            },
            {
                "name": "GPUMemoryHigh",
                "expr": 'vllm_gpu_cache_usage_perc > 0.95',
                "severity": "critical",
                "summary": "GPU KV cache usage > 95%",
            },
            {
                "name": "HighErrorRate",
                "expr": 'rate(vllm_request_failure_total[5m]) / rate(vllm_request_success_total[5m]) > 0.05',
                "severity": "critical",
                "summary": "Error rate > 5%",
            },
            {
                "name": "QueueBacklog",
                "expr": 'vllm_num_requests_waiting > 100',
                "severity": "warning",
                "summary": "More than 100 requests in queue",
            },
        ]

dashboard = LLMMonitoringDashboard()
metrics = dashboard.key_metrics()
print("Key Metrics:")
for category, items in metrics.items():
    print(f"\n  {category}:")
    for k, v in items.items():
        if k != "query":
            print(f"    {k}: {v}")

alerts = dashboard.alert_rules()
print(f"\nAlert Rules: {len(alerts)} rules")
for a in alerts:
    print(f"  {a['name']}: {a['summary']} ({a['severity']})")

FAQ ??????????????????????????????????????????

Q: vLLM ????????? TGI (Text Generation Inference) ???????????????????????????????????????????

A: vLLM ????????? PagedAttention ?????????????????? KV cache ??????????????????????????? throughput ????????????????????? 2-4 ?????????????????? high-concurrency scenarios ?????????????????? models ???????????????????????? community ???????????? OpenAI-compatible API TGI ????????? HuggingFace ????????????????????? integrate ????????? HuggingFace Hub ?????? ?????? built-in features ???????????? watermarking, grammar-guided generation ?????????????????? production ?????????????????????????????? throughput ????????? ??????????????? vLLM ?????????????????? prototype ????????????????????????????????? HuggingFace integration ??????????????? TGI ????????????????????? free open source

Q: GPU memory ????????????????????????????????? model ????????????????????????????

A: ?????????????????????????????? Quantization ?????? precision ????????? FP16 ???????????? INT8 (?????? 2x) ???????????? INT4 (?????? 4x) ???????????? AWQ, GPTQ, GGUF ???????????????????????????????????????????????? Tensor Parallelism ?????????????????? model ???????????????????????? GPUs ???????????? 70B model ????????? 2x A100 (80GB) Pipeline Parallelism ???????????? layers ???????????? GPUs ?????????????????? model ?????????????????????????????? KV Cache Optimization ?????? max_model_len ??????????????????????????? actual usage ????????????????????? 128K ??????????????????????????? 4K Offloading ????????????????????? layers ?????? CPU RAM (??????????????????????????????) ?????????????????? 8B model ????????? 1x A10G (24GB) ????????????????????? 70B model ???????????? 2x A100 ???????????? quantize ???????????? INT4 ????????? 1x A100

Q: Service Mesh ????????????????????????????????????????????? LLM inference?

A: ????????????????????? scale ??????????????? 1-2 instances ??????????????????????????? ????????? Kubernetes Service + Ingress ????????????????????? ??????????????? 5+ instances ?????????????????????????????? canary deployment, circuit breaking, mTLS, detailed observability ?????????????????? Istio ???????????? Linkerd ?????????????????????????????? ????????????????????? service mesh ??????????????????????????? A/B testing models ????????? 10% traffic ?????? model ????????????, Circuit breaking ????????? GPU instance ????????? ?????????????????????????????????????????????, Rate limiting per user/tenant, mTLS security ????????????????????? services, Observability ?????? latency distribution per model version ????????????????????? ??????????????? complexity ????????? latency ???????????????????????? (1-3ms per request)

Q: Cost optimization ?????????????????? LLM inference ????????????????????????????

A: ?????????????????????????????????????????? Model Quantization INT4/INT8 ?????? GPU ?????????????????????????????? 2-4 ???????????? ????????????????????? 50-75% Spot/Preemptible Instances ??????????????????????????? batch requests ????????????????????? 60-90% Request Batching vLLM ?????? continuous batching ??????????????????????????? ??????????????? throughput 2-4x Caching cache responses ?????????????????? common queries ?????? GPU usage Model Selection ????????? model ?????????????????????????????? simple tasks (8B) model ?????????????????????????????? complex tasks (70B) Auto-scaling scale down ??????????????? traffic ????????? ?????????????????????????????? GPU ??????????????????/???????????????????????????????????? ???????????????????????? 8B model INT4 ?????? spot A10G ???????????? ~$0.30/hr serve ????????? ~100 req/s

📖 บทความที่เกี่ยวข้อง

LLM Inference vLLM FinOps Cloud Costอ่านบทความ → LLM Inference vLLM Chaos Engineeringอ่านบทความ → LLM Inference vLLM Consensus Algorithmอ่านบทความ → LLM Inference vLLM Interview Preparationอ่านบทความ → LLM Inference vLLM IoT Gatewayอ่านบทความ →

📚 ดูบทความทั้งหมด →