SiamCafe.net Blog
Technology

TensorRT Optimization SaaS Architecture สร้าง AI Inference Platform ทเร็วและคมคา

tensorrt optimization saas architecture
TensorRT Optimization SaaS Architecture | SiamCafe Blog
2025-07-02· อ. บอม — SiamCafe.net· 876 คำ

TensorRT ?????????????????????

TensorRT ???????????? high-performance deep learning inference SDK ????????? NVIDIA ??????????????????????????????????????? optimize neural network models ?????????????????????????????????????????????????????? NVIDIA GPUs ??????????????? inference ???????????????????????? 2-10x ??????????????????????????????????????? frameworks ???????????? (PyTorch, TensorFlow) ???????????????????????????????????? Layer fusion ????????????????????? layers ???????????? kernel ???????????????, Precision calibration ?????? precision ????????? FP32 ???????????? FP16/INT8 ??????????????????????????????????????? accuracy ?????????, Kernel auto-tuning ??????????????? kernel ????????????????????????????????????????????????????????? GPU ???????????????????????????, Memory optimization ?????? memory footprint

?????????????????? SaaS (Software as a Service) ???????????????????????????????????? AI inference TensorRT ???????????????????????? ?????? latency ??????????????? user experience ??????, ?????? GPU cost serve ??????????????????????????????????????? GPU 1 ?????????, ??????????????? throughput ?????????????????? concurrent requests ????????????????????? ???????????????????????????????????????????????? production ???????????? chatbot APIs, image recognition services, recommendation systems, real-time video analytics

??????????????????????????????????????????????????? TensorRT

Setup TensorRT ?????????????????? production inference

# === TensorRT Installation ===

# 1. Install via NVIDIA Container (recommended)
docker pull nvcr.io/nvidia/tensorrt:24.05-py3

# Run container
docker run --gpus all -it --rm \
  -v $(pwd)/models:/models \
  -v $(pwd)/data:/data \
  nvcr.io/nvidia/tensorrt:24.05-py3

# 2. Install via pip (for development)
pip install tensorrt==10.0.1
pip install torch torchvision onnx onnxruntime-gpu

# 3. Verify installation
python3 -c "
import tensorrt as trt
print(f'TensorRT version: {trt.__version__}')
print(f'CUDA: {trt.cuda_version()}')

logger = trt.Logger(trt.Logger.INFO)
builder = trt.Builder(logger)
print(f'Max batch size: {builder.max_batch_size}')
print(f'Platform has fast FP16: {builder.platform_has_fast_fp16}')
print(f'Platform has fast INT8: {builder.platform_has_fast_int8}')
"

# 4. Docker Compose for TensorRT Service
cat > docker-compose.yml << 'EOF'
version: '3.8'
services:
  triton:
    image: nvcr.io/nvidia/tritoninferenceserver:24.05-py3
    ports:
      - "8000:8000"   # HTTP
      - "8001:8001"   # gRPC
      - "8002:8002"   # Metrics
    volumes:
      - ./model_repository:/models
    command: tritonserver --model-repository=/models --strict-model-config=false
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

  api:
    image: python:3.12-slim
    ports:
      - "8080:8080"
    volumes:
      - ./api:/app
    command: uvicorn app.main:app --host 0.0.0.0 --port 8080
    depends_on:
      - triton
EOF

echo "TensorRT environment ready"

Model Optimization Pipeline

????????????????????? optimize model ???????????? TensorRT

#!/usr/bin/env python3
# tensorrt_optimize.py ??? Model Optimization Pipeline
import json
import logging
from typing import Dict, List

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger("optimize")

class TensorRTOptimizer:
    """TensorRT Model Optimization Pipeline"""
    
    def __init__(self):
        self.optimization_steps = []
    
    def optimization_pipeline(self):
        return {
            "step_1_export_onnx": {
                "description": "Export PyTorch model to ONNX format",
                "code": """
import torch
import torchvision.models as models

model = models.resnet50(pretrained=True)
model.eval()

dummy_input = torch.randn(1, 3, 224, 224)
torch.onnx.export(
    model, dummy_input, 'resnet50.onnx',
    input_names=['input'],
    output_names=['output'],
    dynamic_axes={'input': {0: 'batch_size'}, 'output': {0: 'batch_size'}},
    opset_version=17,
)
                """,
            },
            "step_2_optimize_trt": {
                "description": "Convert ONNX to TensorRT engine",
                "code": """
# Using trtexec CLI
trtexec --onnx=resnet50.onnx \\
    --saveEngine=resnet50_fp16.engine \\
    --fp16 \\
    --workspace=4096 \\
    --minShapes=input:1x3x224x224 \\
    --optShapes=input:8x3x224x224 \\
    --maxShapes=input:32x3x224x224 \\
    --verbose
                """,
            },
            "step_3_int8_calibration": {
                "description": "INT8 quantization with calibration",
                "code": """
trtexec --onnx=resnet50.onnx \\
    --saveEngine=resnet50_int8.engine \\
    --int8 \\
    --calib=calibration_data/ \\
    --workspace=4096
                """,
            },
            "step_4_benchmark": {
                "description": "Benchmark optimized engine",
                "code": """
trtexec --loadEngine=resnet50_fp16.engine \\
    --batch=8 \\
    --iterations=1000 \\
    --warmUp=500 \\
    --duration=10
                """,
            },
        }
    
    def performance_comparison(self):
        return {
            "resnet50": {
                "pytorch_fp32": {"latency_ms": 12.5, "throughput_fps": 80, "gpu_mem_mb": 1200},
                "tensorrt_fp32": {"latency_ms": 4.2, "throughput_fps": 238, "gpu_mem_mb": 800},
                "tensorrt_fp16": {"latency_ms": 2.1, "throughput_fps": 476, "gpu_mem_mb": 450},
                "tensorrt_int8": {"latency_ms": 1.3, "throughput_fps": 769, "gpu_mem_mb": 280},
            },
            "bert_base": {
                "pytorch_fp32": {"latency_ms": 8.5, "throughput_qps": 118},
                "tensorrt_fp16": {"latency_ms": 2.8, "throughput_qps": 357},
                "tensorrt_int8": {"latency_ms": 1.9, "throughput_qps": 526},
            },
            "gpu": "NVIDIA A100 40GB",
        }

optimizer = TensorRTOptimizer()
pipeline = optimizer.optimization_pipeline()
print("TensorRT Optimization Pipeline:")
for step, info in pipeline.items():
    print(f"  {step}: {info['description']}")

perf = optimizer.performance_comparison()
print(f"\nPerformance (ResNet50 on {perf['gpu']}):")
for config, metrics in perf["resnet50"].items():
    print(f"  {config}: {metrics['latency_ms']}ms, {metrics['throughput_fps']} FPS")

SaaS Architecture ?????????????????? AI Inference

?????????????????? SaaS inference platform

# === AI Inference SaaS Architecture ===

cat > saas_architecture.yaml << 'EOF'
ai_inference_saas:
  components:
    api_gateway:
      technology: "Kong / AWS API Gateway"
      features:
        - "Rate limiting per API key/plan"
        - "Authentication (API key, JWT, OAuth2)"
        - "Request routing and load balancing"
        - "Request/response transformation"
        - "Usage metering for billing"
      
    inference_service:
      technology: "NVIDIA Triton Inference Server"
      features:
        - "Multi-model serving"
        - "Dynamic batching (combine requests)"
        - "Model versioning and A/B testing"
        - "GPU sharing across models"
        - "gRPC and HTTP endpoints"
      config:
        dynamic_batching:
          preferred_batch_size: [4, 8, 16]
          max_queue_delay_microseconds: 100000
        instance_group:
          - kind: KIND_GPU
            count: 2
      
    model_repository:
      storage: "S3 / GCS / Azure Blob"
      structure: |
        model_repository/
          resnet50/
            config.pbtxt
            1/
              model.plan  (TensorRT engine)
          bert_base/
            config.pbtxt
            1/
              model.plan
      
    autoscaler:
      technology: "Kubernetes HPA + KEDA"
      metrics:
        - "GPU utilization > 80%"
        - "Request queue depth > 100"
        - "P99 latency > SLA threshold"
      scaling:
        min_replicas: 1
        max_replicas: 10
        scale_up_time: "60s"
        scale_down_time: "300s"
      
    billing:
      model: "Pay per inference"
      tiers:
        free: "100 requests/day"
        basic: "$0.001/request, 10K/day"
        pro: "$0.0005/request, 100K/day"
        enterprise: "Custom pricing, SLA"
      
  infrastructure:
    gpu_instances:
      - "NVIDIA T4 (cost-effective inference)"
      - "NVIDIA A10G (balanced)"
      - "NVIDIA A100 (high throughput)"
      - "NVIDIA H100 (latest, fastest)"
    
    kubernetes:
      cluster: "EKS / GKE with GPU node pools"
      gpu_operator: "NVIDIA GPU Operator"
      device_plugin: "NVIDIA Device Plugin"
EOF

echo "SaaS architecture defined"

Scaling ????????? Cost Optimization

?????????????????????????????????????????????????????????????????????

#!/usr/bin/env python3
# cost_optimizer.py ??? AI Inference Cost Optimizer
import json
import logging
from typing import Dict

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger("cost")

class InferenceCostOptimizer:
    def __init__(self):
        pass
    
    def gpu_pricing(self):
        return {
            "aws": {
                "g4dn.xlarge": {"gpu": "T4", "price_hr": 0.526, "vram_gb": 16, "best_for": "Cost-effective inference"},
                "g5.xlarge": {"gpu": "A10G", "price_hr": 1.006, "vram_gb": 24, "best_for": "Balanced performance"},
                "p4d.24xlarge": {"gpu": "A100x8", "price_hr": 32.77, "vram_gb": 320, "best_for": "High throughput"},
            },
            "spot_savings": "60-90% discount with Spot instances",
        }
    
    def cost_per_inference(self, gpu_price_hr, throughput_qps):
        """Calculate cost per inference"""
        cost_per_second = gpu_price_hr / 3600
        cost_per_inference = cost_per_second / throughput_qps
        cost_per_1k = cost_per_inference * 1000
        
        return {
            "cost_per_inference": round(cost_per_inference, 6),
            "cost_per_1k_requests": round(cost_per_1k, 4),
            "cost_per_1m_requests": round(cost_per_inference * 1000000, 2),
        }
    
    def optimization_strategies(self):
        return {
            "model_optimization": {
                "fp16_quantization": "?????? latency 50%, ??????????????? throughput 2x, ?????? memory 50%",
                "int8_quantization": "?????? latency 60%, ??????????????? throughput 3x, ?????? memory 70%",
                "model_pruning": "?????????????????? model 30-50% ??????????????????????????????????????? accuracy ?????????",
                "knowledge_distillation": "????????? model ????????????????????? model ????????????",
            },
            "infrastructure": {
                "spot_instances": "??????????????? GPU 60-90% (??????????????? batch, ???????????????????????? real-time)",
                "auto_scaling": "Scale down ??????????????? traffic ????????? (nights, weekends)",
                "gpu_sharing": "????????? multiple models ?????? GPU ??????????????? (MIG, MPS)",
                "right_sizing": "??????????????? GPU ????????????????????????????????? workload (T4 vs A100)",
            },
            "serving": {
                "dynamic_batching": "????????? requests ???????????? batch ??????????????? GPU utilization",
                "response_caching": "Cache results ?????????????????? repeated inputs",
                "model_warmup": "Pre-load models ?????? cold start",
                "request_coalescing": "????????? duplicate requests",
            },
        }

optimizer = InferenceCostOptimizer()

# Cost comparison: T4 vs A100
t4_cost = optimizer.cost_per_inference(0.526, 476)  # T4 + TensorRT FP16
a100_cost = optimizer.cost_per_inference(3.673, 2000)  # A100 + TensorRT FP16

print("Cost per 1M Inferences:")
print(f"  T4 (FP16):  ")
print(f"  A100 (FP16): ")

strategies = optimizer.optimization_strategies()
print("\nOptimization Strategies:")
for category, items in strategies.items():
    print(f"\n  {category}:")
    for name, desc in list(items.items())[:2]:
        print(f"    {name}: {desc}")

Monitoring ????????? Performance Tuning

?????????????????? inference performance

# === Inference Monitoring ===

# 1. Triton Metrics (Prometheus format)
cat > prometheus-triton.yml << 'EOF'
scrape_configs:
  - job_name: 'triton'
    scrape_interval: 5s
    static_configs:
      - targets: ['triton:8002']
    metrics_path: /metrics
EOF

# 2. Key Metrics to Monitor
cat > monitoring_config.yaml << 'EOF'
inference_metrics:
  latency:
    - name: "nv_inference_request_duration_us"
      description: "End-to-end inference latency"
      alert_threshold: "P99 > 100ms"
    - name: "nv_inference_queue_duration_us"
      description: "Time spent in queue"
      alert_threshold: "P99 > 50ms"
    - name: "nv_inference_compute_infer_duration_us"
      description: "GPU compute time"
  
  throughput:
    - name: "nv_inference_request_success"
      description: "Successful inferences per second"
    - name: "nv_inference_request_failure"
      description: "Failed inferences (should be ~0)"
  
  gpu:
    - name: "DCGM_FI_DEV_GPU_UTIL"
      description: "GPU utilization %"
      target: "60-80%"
    - name: "DCGM_FI_DEV_FB_USED"
      description: "GPU memory used"
    - name: "DCGM_FI_DEV_GPU_TEMP"
      description: "GPU temperature"
      alert: "> 85C"
  
  business:
    - name: "requests_per_api_key"
      description: "Requests per customer"
    - name: "cost_per_inference"
      description: "Running cost per inference"
    - name: "sla_compliance"
      description: "% requests within SLA latency"
EOF

# 3. Grafana Dashboard Query Examples
cat > grafana_queries.txt << 'EOF'
# P99 Latency
histogram_quantile(0.99, rate(nv_inference_request_duration_us_bucket[5m]))

# Throughput (requests/sec)
rate(nv_inference_request_success[1m])

# GPU Utilization
avg(DCGM_FI_DEV_GPU_UTIL)

# Error Rate
rate(nv_inference_request_failure[5m]) / rate(nv_inference_request_success[5m]) * 100

# Queue Depth
nv_inference_pending_request_count
EOF

echo "Monitoring configured"

FAQ ??????????????????????????????????????????

Q: TensorRT ????????? ONNX Runtime ???????????????????????????????????????????

A: TensorRT ???????????? NVIDIA-specific optimizer ????????? performance ???????????????????????? NVIDIA GPUs ?????????????????? FP16/INT8 quantization, layer fusion, kernel auto-tuning ??????????????? GPU ??????????????????????????? ???????????????????????? ????????????????????????????????? NVIDIA GPUs, TensorRT engine ?????????????????? GPU architecture (???????????? rebuild ???????????????????????????????????? GPU) ONNX Runtime ?????????????????? multiple backends (CPU, GPU, NPU) portable ???????????? ???????????????????????? Intel, AMD, ARM, NVIDIA ??????????????????????????????????????? deploy ????????? performance ?????? NVIDIA GPU ????????????????????? TensorRT 20-40% ??????????????? ????????? TensorRT ?????????????????? production NVIDIA GPU inference, ????????? ONNX Runtime ?????????????????? multi-platform deployment ???????????? CPU inference

Q: FP16 ????????? INT8 quantization ??????????????????????????????????

A: FP16 (Half Precision) ?????? latency ~50%, accuracy loss ????????????????????? (< 0.1%), ????????????????????? calibration ????????????????????????????????? ????????????????????????????????? ????????? model ?????????????????????????????? balance ????????????????????? speed ????????? accuracy INT8 (8-bit Integer) ?????? latency ~60-70%, accuracy loss ????????????????????? FP16 (0.5-2%), ???????????? calibration ???????????? representative dataset ?????????????????? calibration data ??????????????? ????????????????????????????????? CNN models (ResNet, YOLO) ????????? tolerate accuracy loss ????????? ??????????????? ???????????????????????? FP16 (???????????? ????????????) ????????? latency ???????????????????????? ????????????????????? INT8 ???????????? accuracy ???????????? deploy

Q: Triton Inference Server ????????????????????????????

A: ???????????????????????????????????????????????????????????????????????????????????????????????? production Triton ????????? Dynamic batching ????????? requests ??????????????????????????? ??????????????? GPU utilization, Model versioning deploy model ?????????????????????????????? downtime, Multi-model serving ????????? models ??????????????????????????? GPU ???????????????, Health checks, metrics, model management ??????????????????????????? Triton ???????????? implement features ????????????????????????????????? ?????????????????????????????????????????? Alternatives ???????????? TorchServe (PyTorch), TensorFlow Serving, BentoML, Ray Serve ????????? Triton ?????????????????? TensorRT ???????????????????????????????????????????????????????????? NVIDIA ??????????????????????????? ?????????????????? prototype ????????? FastAPI + TensorRT SDK ???????????? ?????????????????? production ????????? Triton

Q: GPU ?????????????????????????????????????????? inference SaaS?

A: ????????????????????? workload ????????? budget NVIDIA T4 ??????????????????????????????????????? ($0.526/hr AWS) 16GB VRAM, FP16/INT8 ?????? ??????????????? small-medium models, cost-sensitive workloads NVIDIA A10G balanced ($1/hr) 24GB VRAM ??????????????? medium models, moderate throughput NVIDIA A100 high-end ($3.67/hr 40GB) 40/80GB VRAM ??????????????? large models (LLMs), high throughput NVIDIA H100 latest ($6+/hr) 80GB VRAM, fastest ??????????????? LLM inference, latency-critical ?????????????????? startup ???????????????????????? T4 (Spot instances) ???????????????????????????????????? ??????????????? traffic ??????????????????????????? upgrade ???????????? A10G/A100 ????????? auto-scaling ??????????????????????????? GPU ????????? demand

📖 บทความที่เกี่ยวข้อง

TensorRT Optimization CDN Configurationอ่านบทความ → TensorRT Optimization Data Pipeline ETLอ่านบทความ → TensorRT Optimization Learning Path Roadmapอ่านบทความ → TensorRT Optimization Cloud Native Designอ่านบทความ → TensorRT Optimization Security Hardening ป้องกันแฮกอ่านบทความ →

📚 ดูบทความทั้งหมด →