ai

LLM Inference vLLM Micro-segmentation — AI Serving กับ Network Security

LLM Inference vLLM Micro-segmentation — AI Serving กับ Network Security

vLLM Micro-segmentation

LLM Inference vLLM Micro-segmentation — AI Serving กับ Network Security

vLLM LLM Inference PagedAttention Continuous Batching Micro-segmentation Zero Trust Network Policy Calico Cilium Kubernetes GPU Model Serving OpenAI API

EngineThroughputMemoryAPIFeatures
vLLMสูงมากPagedAttentionOpenAI CompatibleContinuous Batch
TGIสูงTensor ParallelCustom + OpenAIWatermark
OllamaปานกลางQuantizedCustomใช้ง่ายมาก
llama.cppปานกลางGGUF QuantizedSimple HTTPCPU Support

vLLM Deployment

=== vLLM Setup ===

Install

pip install vllm

Run vLLM Server

python -m vllm.entrypoints.openai.api_server \

--model meta-llama/Llama-3.1-8B-Instruct \

--tensor-parallel-size 1 \

--gpu-memory-utilization 0.9 \

--max-model-len 8192 \

--port 8000

Docker

docker run --gpus all \

-v ~/.cache/huggingface:/root/.cache/huggingface \

-p 8000:8000 \

vllm/vllm-openai:latest \

--model meta-llama/Llama-3.1-8B-Instruct

Kubernetes Deployment

apiVersion: apps/v1

kind: Deployment

เนื้อหาเกี่ยวข้อง — ทำความเข้าใจ uob thailand swift code

metadata:

name: vllm-llama

namespace: ai-inference

spec:

replicas: 2

selector:

matchLabels: { app: vllm-llama }

template:

spec:

containers:

  • name: vllm

image: vllm/vllm-openai:latest

args:

  • "--model"
  • "meta-llama/Llama-3.1-8B-Instruct"
  • "--tensor-parallel-size"
  • "1"

resources:

limits:

nvidia.com/gpu: 1

แนะนำเพิ่มเติม — หนังสือเทรดที่ SiamCafeBook

ports:

  • containerPort: 8000

OpenAI Compatible API Call

import openai

client = openai.OpenAI(base_url="http://vllm:8000/v1", api_key="dummy")

response = client.chat.completions.create(

model="meta-llama/Llama-3.1-8B-Instruct",

messages=[{"role": "user", "content": "Hello"}],

max_tokens=512,

)

print(response.choices[0].message.content)

from dataclasses import dataclass

@dataclass

class ModelConfig:

model: str

gpu: str

vram_gb: int

เนื้อหาเกี่ยวข้อง — ดูเพิ่มเติมเรื่อง LLM Fine-tuning LoRA Log Management ELK

throughput_tps: int

latency_ms: int

cost_hr: float

models = [

ModelConfig("Llama-3.1-8B", "A100 80GB", 16, 150, 45, 3.40),

ModelConfig("Llama-3.1-70B", "A100 80GB x2", 140, 40, 120, 6.80),

ModelConfig("Mistral-7B", "A10G 24GB", 14, 180, 35, 1.02),

ModelConfig("Qwen2-72B", "A100 80GB x2", 145, 35, 130, 6.80),

ModelConfig("Phi-3-mini-4k", "T4 16GB", 8, 200, 25, 0.53),

]

print("=== vLLM Model Configs ===")

แนะนำเพิ่มเติม — ระบบเทรดของ iCafeForex

for m in models:

print(f" [{m.model}] {m.gpu}")

print(f" VRAM: {m.vram_gb}GB | TPS: {m.throughput_tps} | Latency: {m.latency_ms}ms | /hr")

Micro-segmentation Policy

=== Kubernetes Network Policy ===

Calico Network Policy — Restrict LLM Pod

apiVersion: projectcalico.org/v3

kind: NetworkPolicy

metadata:

name: vllm-inference-policy

namespace: ai-inference

เนื้อหาเกี่ยวข้อง — CircleCI Orbs Freelance IT Career

spec:

selector: app == 'vllm-llama'

types: [Ingress, Egress]

ingress:

  • action: Allow

source:

LLM Inference vLLM Micro-segmentation — AI Serving กับ Network Security

selector: app == 'api-gateway'

destination:

ports: [8000]

egress:

  • action: Allow

destination:

selector: app == 'model-cache'

ports: [6379]

  • action: Deny

Cilium Network Policy

apiVersion: cilium.io/v2

kind: CiliumNetworkPolicy

metadata:

name: vllm-l7-policy

namespace: ai-inference

spec:

endpointSelector:

matchLabels: { app: vllm-llama }

ingress:

  • fromEndpoints:
  • matchLabels: { app: api-gateway }

toPorts:

  • ports: [{ port: "8000", protocol: TCP }]

rules:

http:

  • method: POST

path: "/v1/chat/completions"

  • method: GET

path: "/health"

@dataclass

class NetworkRule:

name: str

direction: str

source: str

destination: str

เนื้อหาเกี่ยวข้อง — ethical hacking book for beginners — คู่มือฉบับสมบูรณ์ 2026

port: int

action: str

rules = [

NetworkRule("API -> vLLM", "Ingress", "api-gateway", "vllm-llama", 8000, "Allow"),

NetworkRule("vLLM -> Cache", "Egress", "vllm-llama", "model-cache", 6379, "Allow"),

NetworkRule("vLLM -> Metrics", "Egress", "vllm-llama", "prometheus", 9090, "Allow"),

NetworkRule("vLLM -> Internet", "Egress", "vllm-llama", "0.0.0.0/0", 443, "Deny"),

NetworkRule("vLLM -> Other Pods", "Egress", "vllm-llama", "*", 0, "Deny"),

NetworkRule("Other -> vLLM", "Ingress", "*", "vllm-llama", 8000, "Deny"),

]

print("\n=== Network Policies ===")

for r in rules:

print(f" [{r.action}] {r.name}")

print(f" {r.direction}: {r.source} -> {r.destination}:{r.port}")

Production Architecture

# === Production LLM + Zero Trust ===

# Architecture:
# Client -> WAF -> API Gateway (auth+rate limit)
# -> vLLM Pod (GPU, Network Policy isolated)
# -> Response -> Client
#
# Monitoring: Prometheus -> Grafana
# Logging: vLLM -> Fluentbit -> Elasticsearch
# Audit: API Gateway logs all requests

# Rate Limiting & Token Budget
# apiVersion: gateway.networking.k8s.io/v1
# kind: HTTPRoute
# metadata:
# name: llm-route
# spec:
# rules:
# - matches:
# - path: { value: "/v1/chat/completions" }
# filters:
# - type: ExtensionRef
# extensionRef:
# name: rate-limit-100rpm
# backendRefs:
# - name: vllm-service
# port: 8000

security_layers = {
 "WAF": "Block SQL Injection, XSS, Bot Detection",
 "API Gateway": "Authentication, Rate Limiting, Token Budget",
 "Network Policy": "Micro-segmentation, Ingress/Egress Control",
 "Pod Security": "Non-root, Read-only FS, No Privilege Escalation",
 "Encryption": "mTLS between services, TLS 1.3 external",
 "Audit Log": "All API calls logged with user, tokens, cost",
 "Model Protection": "Weights encrypted at rest, no egress allowed",
}

print("Zero Trust Security Layers:")
for layer, desc in security_layers.items():
 print(f" [{layer}]: {desc}")

# Cost Monitoring
costs = {
 "GPU (A100 x2)": "$6.80/hr = $4,896/mo",
 "API Gateway": "$50/mo",
 "Monitoring": "$30/mo",
 "Network (egress)": "$100/mo",
 "Total": "~$5,076/mo",
 "Cost per 1M tokens": "~$0.15 (vs OpenAI $3.00)",
 "Savings": "95% vs OpenAI API",
}

print(f"\n\nCost Analysis:")
for item, cost in costs.items():
 print(f" {item}: {cost}")

เคล็ดลับ

  • PagedAttention: vLLM ใช้ Memory ดีกว่า ใส่ Model ใหญ่ขึ้นได้
  • Deny All: เริ่มจาก Deny All แล้ว Allow เฉพาะที่ต้องการ
  • Rate Limit: จำกัด Request ต่อ User ป้องกัน Abuse
  • Egress Block: ห้าม LLM Pod เข้าถึง Internet ป้องกัน Data Leak
  • Audit: Log ทุก Request สำหรับ Compliance และ Cost Tracking

vLLM คืออะไร

LLM Inference Engine PagedAttention Continuous Batching Throughput 10-24x OpenAI Compatible Llama Mistral Qwen GPU Serving

Micro-segmentation คืออะไร

แบ่ง Network เล็กๆ Policy ควบคุมสื่อสาร Zero Trust K8s Network Policy Calico Cilium East-West ป้องกัน Lateral Movement

ทำไมต้อง Micro-segmentation สำหรับ LLM

GPU ราคาแพง Model Weights มีค่า API Access Control Rate Limit Token Budget แยก Pod ควบคุม Ingress Egress Data Exfiltration

vLLM กับ TGI ต่างกันอย่างไร

vLLM PagedAttention Throughput สูงสุด Continuous Batch OpenAI API TGI Tensor Parallel Streaming Watermark ทั้งสองดี เลือกตามใช้งาน

สรุป

vLLM LLM Inference PagedAttention Continuous Batching Micro-segmentation Zero Trust Calico Cilium Network Policy Kubernetes GPU OpenAI API Rate Limit Token Budget Security

XM Legend · เทรดเดอร์ & ผู้สอน Forex 13 ปี

ผู้ก่อตั้ง SiamCafe ตั้งแต่ปี 1997 · เทรดเดอร์สาย Forex มากกว่า 13 ปี ได้รับการยกย่องเป็น XM Legend · แบ่งปันความรู้ Forex, ไอที, AI และการเทรด จากประสบการณ์จริงในตลาดจริง