SiamCafe.net Blog
Technology

LLM Inference vLLM Micro-segmentation

llm inference vllm micro segmentation
LLM Inference vLLM Micro-segmentation | SiamCafe Blog
2025-12-15· อ. บอม — SiamCafe.net· 9,724 คำ

vLLM Micro-segmentation

vLLM LLM Inference PagedAttention Continuous Batching Micro-segmentation Zero Trust Network Policy Calico Cilium Kubernetes GPU Model Serving OpenAI API

EngineThroughputMemoryAPIFeatures
vLLMสูงมากPagedAttentionOpenAI CompatibleContinuous Batch
TGIสูงTensor ParallelCustom + OpenAIWatermark
OllamaปานกลางQuantizedCustomใช้ง่ายมาก
llama.cppปานกลางGGUF QuantizedSimple HTTPCPU Support

vLLM Deployment

# === vLLM Setup ===

# Install
# pip install vllm

# Run vLLM Server
# python -m vllm.entrypoints.openai.api_server \
# --model meta-llama/Llama-3.1-8B-Instruct \
# --tensor-parallel-size 1 \
# --gpu-memory-utilization 0.9 \
# --max-model-len 8192 \
# --port 8000

# Docker
# docker run --gpus all \
# -v ~/.cache/huggingface:/root/.cache/huggingface \
# -p 8000:8000 \
# vllm/vllm-openai:latest \
# --model meta-llama/Llama-3.1-8B-Instruct

# Kubernetes Deployment
# apiVersion: apps/v1
# kind: Deployment
# metadata:
# name: vllm-llama
# namespace: ai-inference
# spec:
# replicas: 2
# selector:
# matchLabels: { app: vllm-llama }
# template:
# spec:
# containers:
# - name: vllm
# image: vllm/vllm-openai:latest
# args:
# - "--model"
# - "meta-llama/Llama-3.1-8B-Instruct"
# - "--tensor-parallel-size"
# - "1"
# resources:
# limits:
# nvidia.com/gpu: 1
# ports:
# - containerPort: 8000

# OpenAI Compatible API Call
# import openai
# client = openai.OpenAI(base_url="http://vllm:8000/v1", api_key="dummy")
# response = client.chat.completions.create(
# model="meta-llama/Llama-3.1-8B-Instruct",
# messages=[{"role": "user", "content": "Hello"}],
# max_tokens=512,
# )
# print(response.choices[0].message.content)

from dataclasses import dataclass

@dataclass
class ModelConfig:
 model: str
 gpu: str
 vram_gb: int
 throughput_tps: int
 latency_ms: int
 cost_hr: float

models = [
 ModelConfig("Llama-3.1-8B", "A100 80GB", 16, 150, 45, 3.40),
 ModelConfig("Llama-3.1-70B", "A100 80GB x2", 140, 40, 120, 6.80),
 ModelConfig("Mistral-7B", "A10G 24GB", 14, 180, 35, 1.02),
 ModelConfig("Qwen2-72B", "A100 80GB x2", 145, 35, 130, 6.80),
 ModelConfig("Phi-3-mini-4k", "T4 16GB", 8, 200, 25, 0.53),
]

print("=== vLLM Model Configs ===")
for m in models:
 print(f" [{m.model}] {m.gpu}")
 print(f" VRAM: {m.vram_gb}GB | TPS: {m.throughput_tps} | Latency: {m.latency_ms}ms | /hr")

Micro-segmentation Policy

# === Kubernetes Network Policy ===

# Calico Network Policy — Restrict LLM Pod
# apiVersion: projectcalico.org/v3
# kind: NetworkPolicy
# metadata:
# name: vllm-inference-policy
# namespace: ai-inference
# spec:
# selector: app == 'vllm-llama'
# types: [Ingress, Egress]
# ingress:
# - action: Allow
# source:
# selector: app == 'api-gateway'
# destination:
# ports: [8000]
# egress:
# - action: Allow
# destination:
# selector: app == 'model-cache'
# ports: [6379]
# - action: Deny

# Cilium Network Policy
# apiVersion: cilium.io/v2
# kind: CiliumNetworkPolicy
# metadata:
# name: vllm-l7-policy
# namespace: ai-inference
# spec:
# endpointSelector:
# matchLabels: { app: vllm-llama }
# ingress:
# - fromEndpoints:
# - matchLabels: { app: api-gateway }
# toPorts:
# - ports: [{ port: "8000", protocol: TCP }]
# rules:
# http:
# - method: POST
# path: "/v1/chat/completions"
# - method: GET
# path: "/health"

@dataclass
class NetworkRule:
 name: str
 direction: str
 source: str
 destination: str
 port: int
 action: str

rules = [
 NetworkRule("API -> vLLM", "Ingress", "api-gateway", "vllm-llama", 8000, "Allow"),
 NetworkRule("vLLM -> Cache", "Egress", "vllm-llama", "model-cache", 6379, "Allow"),
 NetworkRule("vLLM -> Metrics", "Egress", "vllm-llama", "prometheus", 9090, "Allow"),
 NetworkRule("vLLM -> Internet", "Egress", "vllm-llama", "0.0.0.0/0", 443, "Deny"),
 NetworkRule("vLLM -> Other Pods", "Egress", "vllm-llama", "*", 0, "Deny"),
 NetworkRule("Other -> vLLM", "Ingress", "*", "vllm-llama", 8000, "Deny"),
]

print("\n=== Network Policies ===")
for r in rules:
 print(f" [{r.action}] {r.name}")
 print(f" {r.direction}: {r.source} -> {r.destination}:{r.port}")

Production Architecture

# === Production LLM + Zero Trust ===

# Architecture:
# Client -> WAF -> API Gateway (auth+rate limit)
# -> vLLM Pod (GPU, Network Policy isolated)
# -> Response -> Client
#
# Monitoring: Prometheus -> Grafana
# Logging: vLLM -> Fluentbit -> Elasticsearch
# Audit: API Gateway logs all requests

# Rate Limiting & Token Budget
# apiVersion: gateway.networking.k8s.io/v1
# kind: HTTPRoute
# metadata:
# name: llm-route
# spec:
# rules:
# - matches:
# - path: { value: "/v1/chat/completions" }
# filters:
# - type: ExtensionRef
# extensionRef:
# name: rate-limit-100rpm
# backendRefs:
# - name: vllm-service
# port: 8000

security_layers = {
 "WAF": "Block SQL Injection, XSS, Bot Detection",
 "API Gateway": "Authentication, Rate Limiting, Token Budget",
 "Network Policy": "Micro-segmentation, Ingress/Egress Control",
 "Pod Security": "Non-root, Read-only FS, No Privilege Escalation",
 "Encryption": "mTLS between services, TLS 1.3 external",
 "Audit Log": "All API calls logged with user, tokens, cost",
 "Model Protection": "Weights encrypted at rest, no egress allowed",
}

print("Zero Trust Security Layers:")
for layer, desc in security_layers.items():
 print(f" [{layer}]: {desc}")

# Cost Monitoring
costs = {
 "GPU (A100 x2)": "$6.80/hr = $4,896/mo",
 "API Gateway": "$50/mo",
 "Monitoring": "$30/mo",
 "Network (egress)": "$100/mo",
 "Total": "~$5,076/mo",
 "Cost per 1M tokens": "~$0.15 (vs OpenAI $3.00)",
 "Savings": "95% vs OpenAI API",
}

print(f"\n\nCost Analysis:")
for item, cost in costs.items():
 print(f" {item}: {cost}")

เคล็ดลับ

vLLM คืออะไร

LLM Inference Engine PagedAttention Continuous Batching Throughput 10-24x OpenAI Compatible Llama Mistral Qwen GPU Serving

Micro-segmentation คืออะไร

แบ่ง Network เล็กๆ Policy ควบคุมสื่อสาร Zero Trust K8s Network Policy Calico Cilium East-West ป้องกัน Lateral Movement

ทำไมต้อง Micro-segmentation สำหรับ LLM

GPU ราคาแพง Model Weights มีค่า API Access Control Rate Limit Token Budget แยก Pod ควบคุม Ingress Egress Data Exfiltration

vLLM กับ TGI ต่างกันอย่างไร

vLLM PagedAttention Throughput สูงสุด Continuous Batch OpenAI API TGI Tensor Parallel Streaming Watermark ทั้งสองดี เลือกตามใช้งาน

สรุป

vLLM LLM Inference PagedAttention Continuous Batching Micro-segmentation Zero Trust Calico Cilium Network Policy Kubernetes GPU OpenAI API Rate Limit Token Budget Security

📖 บทความที่เกี่ยวข้อง

LLM Inference vLLM Consensus Algorithmอ่านบทความ → LLM Inference vLLM Chaos Engineeringอ่านบทความ → LLM Inference vLLM FinOps Cloud Costอ่านบทความ → LLM Inference vLLM Network Segmentationอ่านบทความ → LLM Inference vLLM Technical Debt Managementอ่านบทความ →

📚 ดูบทความทั้งหมด →