vLLM ?????????????????????
vLLM ???????????? high-throughput LLM inference engine ??????????????????????????????????????????????????? serve Large Language Models ?????????????????????????????????????????????????????? ????????? PagedAttention algorithm ?????????????????? KV cache ?????????????????????????????????????????????????????? ?????? memory waste ??????????????? serve ?????????????????????????????? requests ??????????????????????????????????????????????????????????????? inference frameworks ????????????
???????????????????????????????????????????????? vLLM ?????????????????? PagedAttention ?????????????????? KV cache ????????? paging ?????????????????? virtual memory ?????? OS ?????? memory fragmentation, Continuous Batching ????????? requests ?????????????????????????????????????????????????????????????????? batch ??????????????? ??????????????????????????? batch ????????????, Tensor Parallelism ?????????????????? model ???????????? GPUs ?????????????????? models ????????????????????????????????? 1 GPU, OpenAI-compatible API ?????????????????? OpenAI API ???????????????????????? ?????????????????????????????? base URL
Service Mesh ?????????????????? LLM inference ?????????????????????????????? traffic routing, load balancing, circuit breaking, observability ????????????????????? inference instances ??????????????? scale ????????????????????? failover ??????????????????????????? ?????? mTLS ????????????????????? services ?????????????????????
????????????????????? vLLM ?????????????????? LLM Inference
Setup vLLM server ?????????????????? production
# === vLLM Installation ===
# 1. Install vLLM
pip install vllm
# 2. Start vLLM Server (OpenAI-compatible)
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Meta-Llama-3.1-8B-Instruct \
--host 0.0.0.0 \
--port 8000 \
--tensor-parallel-size 1 \
--max-model-len 4096 \
--gpu-memory-utilization 0.9 \
--dtype auto \
--enforce-eager false \
--max-num-seqs 256 \
--max-num-batched-tokens 32768
# 3. Docker Deployment
cat > Dockerfile.vllm << 'EOF'
FROM vllm/vllm-openai:latest
ENV MODEL_NAME=meta-llama/Meta-Llama-3.1-8B-Instruct
ENV TENSOR_PARALLEL=1
ENV MAX_MODEL_LEN=4096
ENV GPU_MEMORY_UTILIZATION=0.9
EXPOSE 8000
CMD python -m vllm.entrypoints.openai.api_server \
--model $MODEL_NAME \
--host 0.0.0.0 \
--port 8000 \
--tensor-parallel-size $TENSOR_PARALLEL \
--max-model-len $MAX_MODEL_LEN \
--gpu-memory-utilization $GPU_MEMORY_UTILIZATION
EOF
docker build -t vllm-server -f Dockerfile.vllm .
docker run --gpus all -p 8000:8000 vllm-server
# 4. Test API (OpenAI-compatible)
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is Kubernetes?"}
],
"max_tokens": 500,
"temperature": 0.7
}'
# 5. Kubernetes Deployment
cat > k8s/vllm-deployment.yaml << 'EOF'
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-server
namespace: llm-inference
spec:
replicas: 2
selector:
matchLabels:
app: vllm-server
template:
metadata:
labels:
app: vllm-server
version: v1
annotations:
sidecar.istio.io/inject: "true"
spec:
containers:
- name: vllm
image: vllm/vllm-openai:latest
args:
- "--model"
- "meta-llama/Meta-Llama-3.1-8B-Instruct"
- "--host"
- "0.0.0.0"
- "--port"
- "8000"
- "--tensor-parallel-size"
- "1"
- "--max-model-len"
- "4096"
ports:
- containerPort: 8000
resources:
limits:
nvidia.com/gpu: 1
memory: "24Gi"
requests:
nvidia.com/gpu: 1
memory: "16Gi"
readinessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 120
periodSeconds: 10
volumeMounts:
- name: model-cache
mountPath: /root/.cache/huggingface
- name: shm
mountPath: /dev/shm
volumes:
- name: model-cache
persistentVolumeClaim:
claimName: model-cache-pvc
- name: shm
emptyDir:
medium: Memory
sizeLimit: "8Gi"
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
EOF
echo "vLLM installed and configured"
Service Mesh Architecture ?????????????????? LLM
Deploy Istio Service Mesh ?????????????????? LLM inference
# === Istio Service Mesh for LLM ===
# 1. Install Istio
istioctl install --set profile=production
kubectl label namespace llm-inference istio-injection=enabled
# 2. Virtual Service (Traffic Routing)
cat > istio/virtual-service.yaml << 'EOF'
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: vllm-routing
namespace: llm-inference
spec:
hosts:
- vllm-server
http:
- match:
- headers:
x-model-version:
exact: "v2"
route:
- destination:
host: vllm-server
subset: v2
weight: 100
- route:
- destination:
host: vllm-server
subset: v1
weight: 90
- destination:
host: vllm-server
subset: v2
weight: 10
timeout: 60s
retries:
attempts: 2
perTryTimeout: 30s
retryOn: 5xx,reset,connect-failure
---
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
name: vllm-destination
namespace: llm-inference
spec:
host: vllm-server
trafficPolicy:
connectionPool:
http:
h2UpgradePolicy: UPGRADE
maxRequestsPerConnection: 100
tcp:
maxConnections: 200
outlierDetection:
consecutive5xxErrors: 3
interval: 30s
baseEjectionTime: 60s
maxEjectionPercent: 50
loadBalancer:
simple: LEAST_REQUEST
subsets:
- name: v1
labels:
version: v1
- name: v2
labels:
version: v2
EOF
# 3. Rate Limiting
cat > istio/rate-limit.yaml << 'EOF'
apiVersion: networking.istio.io/v1alpha3
kind: EnvoyFilter
metadata:
name: llm-rate-limit
namespace: llm-inference
spec:
workloadSelector:
labels:
app: vllm-server
configPatches:
- applyTo: HTTP_FILTER
match:
context: SIDECAR_INBOUND
patch:
operation: INSERT_BEFORE
value:
name: envoy.filters.http.local_ratelimit
typed_config:
"@type": type.googleapis.com/udpa.type.v1.TypedStruct
type_url: type.googleapis.com/envoy.extensions.filters.http.local_ratelimit.v3.LocalRateLimit
value:
stat_prefix: http_local_rate_limiter
token_bucket:
max_tokens: 100
tokens_per_fill: 50
fill_interval: 60s
EOF
kubectl apply -f istio/
echo "Service mesh configured"
Load Balancing ????????? Auto-Scaling
????????????????????? load balancing ????????? scaling ?????????????????? LLM
#!/usr/bin/env python3
# llm_scaling.py ??? LLM Inference Scaling Strategy
import json
import logging
from typing import Dict, List
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger("scaling")
class LLMScalingStrategy:
def __init__(self):
self.models = {}
def scaling_config(self):
return {
"horizontal_scaling": {
"metric": "GPU utilization + queue depth",
"min_replicas": 2,
"max_replicas": 20,
"scale_up_threshold": "GPU > 80% OR queue > 50 requests",
"scale_down_threshold": "GPU < 30% AND queue < 5",
"cooldown_period": "5 minutes",
"gpu_type": "A10G (24GB VRAM)",
},
"model_routing": {
"small_requests": {
"max_tokens": 100,
"model": "Llama-3.1-8B",
"gpu": "T4 (16GB)",
"latency_target": "< 500ms",
},
"medium_requests": {
"max_tokens": 1000,
"model": "Llama-3.1-8B",
"gpu": "A10G (24GB)",
"latency_target": "< 3s",
},
"large_requests": {
"max_tokens": 4096,
"model": "Llama-3.1-70B",
"gpu": "A100 (80GB) x2",
"latency_target": "< 15s",
},
},
"cost_optimization": {
"spot_instances": "Use for batch/async requests (60-90% cheaper)",
"on_demand": "Use for real-time requests (SLA guarantee)",
"reserved": "Base capacity (1-3 year commitment, 30-60% cheaper)",
"model_quantization": "INT8/INT4 reduces GPU memory 2-4x",
},
}
def capacity_planning(self, requests_per_second, avg_tokens, model_size_b):
"""Estimate required GPU instances"""
# Rough estimates based on model size
tokens_per_second_per_gpu = {
7: 150, # 7B model on A10G
8: 130, # 8B model on A10G
13: 80, # 13B model on A10G
70: 30, # 70B model on A100x2
}
closest = min(tokens_per_second_per_gpu.keys(), key=lambda x: abs(x - model_size_b))
tps = tokens_per_second_per_gpu[closest]
total_tokens_per_second = requests_per_second * avg_tokens
gpus_needed = total_tokens_per_second / tps
# Add 30% headroom
gpus_with_headroom = gpus_needed * 1.3
return {
"model_size_b": model_size_b,
"requests_per_second": requests_per_second,
"avg_tokens_per_request": avg_tokens,
"total_tokens_per_second": total_tokens_per_second,
"tokens_per_second_per_gpu": tps,
"gpus_needed": round(gpus_needed, 1),
"gpus_with_headroom": round(gpus_with_headroom),
"estimated_monthly_cost": round(gpus_with_headroom * 730 * 1.0, 2),
}
strategy = LLMScalingStrategy()
config = strategy.scaling_config()
print("Scaling Config:")
for tier, info in config["model_routing"].items():
print(f" {tier}: {info['model']} on {info['gpu']}")
plan = strategy.capacity_planning(requests_per_second=10, avg_tokens=200, model_size_b=8)
print(f"\nCapacity Plan: {plan['gpus_with_headroom']} GPUs needed")
print(f"Est. monthly cost: ")
Performance Optimization
Optimize vLLM performance
# === vLLM Performance Optimization ===
# 1. Model Quantization (reduce memory, increase throughput)
# AWQ Quantization (4-bit)
pip install autoawq
python -c "
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
model_path = 'meta-llama/Meta-Llama-3.1-8B-Instruct'
quant_path = 'llama-3.1-8b-awq'
model = AutoAWQForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)
model.quantize(tokenizer, quant_config={'zero_point': True, 'q_group_size': 128, 'w_bit': 4})
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)
print('Quantization complete')
"
# Serve quantized model
python -m vllm.entrypoints.openai.api_server \
--model llama-3.1-8b-awq \
--quantization awq \
--max-model-len 4096 \
--gpu-memory-utilization 0.9
# 2. Prefix Caching (for shared system prompts)
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Meta-Llama-3.1-8B-Instruct \
--enable-prefix-caching \
--max-model-len 4096
# 3. Speculative Decoding (use small model to draft)
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Meta-Llama-3.1-70B-Instruct \
--speculative-model meta-llama/Meta-Llama-3.1-8B-Instruct \
--num-speculative-tokens 5 \
--tensor-parallel-size 2
# 4. Benchmark
python -m vllm.entrypoints.openai.api_server &
sleep 30
# Run benchmark
python -m vllm.benchmark_serving \
--backend openai \
--model meta-llama/Meta-Llama-3.1-8B-Instruct \
--dataset-name sharegpt \
--num-prompts 1000 \
--request-rate 10
# Key metrics to compare:
# - Throughput (requests/sec)
# - Time to First Token (TTFT)
# - Time per Output Token (TPOT)
# - Inter-token Latency (ITL)
echo "Performance optimization complete"
Monitoring ????????? Observability
Monitor LLM inference pipeline
#!/usr/bin/env python3
# llm_monitoring.py ??? LLM Inference Monitoring
import json
import logging
from datetime import datetime
from typing import Dict
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger("monitor")
class LLMMonitoringDashboard:
def __init__(self):
self.metrics = {}
def key_metrics(self):
return {
"latency": {
"ttft_ms": "Time to First Token (target: < 200ms)",
"tpot_ms": "Time per Output Token (target: < 30ms)",
"e2e_latency_ms": "End-to-end latency (target: < 5000ms)",
"query": 'histogram_quantile(0.95, rate(vllm_request_latency_seconds_bucket[5m]))',
},
"throughput": {
"requests_per_second": "Total requests per second",
"tokens_per_second": "Total tokens generated per second",
"query": 'rate(vllm_request_success_total[5m])',
},
"gpu": {
"utilization_pct": "GPU compute utilization",
"memory_used_gb": "GPU memory used",
"kv_cache_usage_pct": "KV cache memory usage",
"query": 'vllm_gpu_cache_usage_perc',
},
"queue": {
"pending_requests": "Requests waiting for processing",
"running_requests": "Requests currently being processed",
"query": 'vllm_num_requests_running',
},
"errors": {
"error_rate": "Percentage of failed requests",
"timeout_rate": "Requests exceeding timeout",
"oom_count": "Out of memory errors",
"query": 'rate(vllm_request_failure_total[5m])',
},
}
def alert_rules(self):
return [
{
"name": "HighLatency",
"expr": 'histogram_quantile(0.95, rate(vllm_request_latency_seconds_bucket[5m])) > 10',
"severity": "warning",
"summary": "P95 latency > 10s",
},
{
"name": "GPUMemoryHigh",
"expr": 'vllm_gpu_cache_usage_perc > 0.95',
"severity": "critical",
"summary": "GPU KV cache usage > 95%",
},
{
"name": "HighErrorRate",
"expr": 'rate(vllm_request_failure_total[5m]) / rate(vllm_request_success_total[5m]) > 0.05',
"severity": "critical",
"summary": "Error rate > 5%",
},
{
"name": "QueueBacklog",
"expr": 'vllm_num_requests_waiting > 100',
"severity": "warning",
"summary": "More than 100 requests in queue",
},
]
dashboard = LLMMonitoringDashboard()
metrics = dashboard.key_metrics()
print("Key Metrics:")
for category, items in metrics.items():
print(f"\n {category}:")
for k, v in items.items():
if k != "query":
print(f" {k}: {v}")
alerts = dashboard.alert_rules()
print(f"\nAlert Rules: {len(alerts)} rules")
for a in alerts:
print(f" {a['name']}: {a['summary']} ({a['severity']})")
FAQ ??????????????????????????????????????????
Q: vLLM ????????? TGI (Text Generation Inference) ???????????????????????????????????????????
A: vLLM ????????? PagedAttention ?????????????????? KV cache ??????????????????????????? throughput ????????????????????? 2-4 ?????????????????? high-concurrency scenarios ?????????????????? models ???????????????????????? community ???????????? OpenAI-compatible API TGI ????????? HuggingFace ????????????????????? integrate ????????? HuggingFace Hub ?????? ?????? built-in features ???????????? watermarking, grammar-guided generation ?????????????????? production ?????????????????????????????? throughput ????????? ??????????????? vLLM ?????????????????? prototype ????????????????????????????????? HuggingFace integration ??????????????? TGI ????????????????????? free open source
Q: GPU memory ????????????????????????????????? model ????????????????????????????
A: ?????????????????????????????? Quantization ?????? precision ????????? FP16 ???????????? INT8 (?????? 2x) ???????????? INT4 (?????? 4x) ???????????? AWQ, GPTQ, GGUF ???????????????????????????????????????????????? Tensor Parallelism ?????????????????? model ???????????????????????? GPUs ???????????? 70B model ????????? 2x A100 (80GB) Pipeline Parallelism ???????????? layers ???????????? GPUs ?????????????????? model ?????????????????????????????? KV Cache Optimization ?????? max_model_len ??????????????????????????? actual usage ????????????????????? 128K ??????????????????????????? 4K Offloading ????????????????????? layers ?????? CPU RAM (??????????????????????????????) ?????????????????? 8B model ????????? 1x A10G (24GB) ????????????????????? 70B model ???????????? 2x A100 ???????????? quantize ???????????? INT4 ????????? 1x A100
Q: Service Mesh ????????????????????????????????????????????? LLM inference?
A: ????????????????????? scale ??????????????? 1-2 instances ??????????????????????????? ????????? Kubernetes Service + Ingress ????????????????????? ??????????????? 5+ instances ?????????????????????????????? canary deployment, circuit breaking, mTLS, detailed observability ?????????????????? Istio ???????????? Linkerd ?????????????????????????????? ????????????????????? service mesh ??????????????????????????? A/B testing models ????????? 10% traffic ?????? model ????????????, Circuit breaking ????????? GPU instance ????????? ?????????????????????????????????????????????, Rate limiting per user/tenant, mTLS security ????????????????????? services, Observability ?????? latency distribution per model version ????????????????????? ??????????????? complexity ????????? latency ???????????????????????? (1-3ms per request)
Q: Cost optimization ?????????????????? LLM inference ????????????????????????????
A: ?????????????????????????????????????????? Model Quantization INT4/INT8 ?????? GPU ?????????????????????????????? 2-4 ???????????? ????????????????????? 50-75% Spot/Preemptible Instances ??????????????????????????? batch requests ????????????????????? 60-90% Request Batching vLLM ?????? continuous batching ??????????????????????????? ??????????????? throughput 2-4x Caching cache responses ?????????????????? common queries ?????? GPU usage Model Selection ????????? model ?????????????????????????????? simple tasks (8B) model ?????????????????????????????? complex tasks (70B) Auto-scaling scale down ??????????????? traffic ????????? ?????????????????????????????? GPU ??????????????????/???????????????????????????????????? ???????????????????????? 8B model INT4 ?????? spot A10G ???????????? ~$0.30/hr serve ????????? ~100 req/s
