TensorRT ?????????????????????
TensorRT ???????????? high-performance deep learning inference SDK ????????? NVIDIA ??????????????????????????????????????? optimize neural network models ?????????????????????????????????????????????????????? NVIDIA GPUs ??????????????? inference ???????????????????????? 2-10x ??????????????????????????????????????? frameworks ???????????? (PyTorch, TensorFlow) ???????????????????????????????????? Layer fusion ????????????????????? layers ???????????? kernel ???????????????, Precision calibration ?????? precision ????????? FP32 ???????????? FP16/INT8 ??????????????????????????????????????? accuracy ?????????, Kernel auto-tuning ??????????????? kernel ????????????????????????????????????????????????????????? GPU ???????????????????????????, Memory optimization ?????? memory footprint
?????????????????? SaaS (Software as a Service) ???????????????????????????????????? AI inference TensorRT ???????????????????????? ?????? latency ??????????????? user experience ??????, ?????? GPU cost serve ??????????????????????????????????????? GPU 1 ?????????, ??????????????? throughput ?????????????????? concurrent requests ????????????????????? ???????????????????????????????????????????????? production ???????????? chatbot APIs, image recognition services, recommendation systems, real-time video analytics
??????????????????????????????????????????????????? TensorRT
Setup TensorRT ?????????????????? production inference
# === TensorRT Installation ===
# 1. Install via NVIDIA Container (recommended)
docker pull nvcr.io/nvidia/tensorrt:24.05-py3
# Run container
docker run --gpus all -it --rm \
-v $(pwd)/models:/models \
-v $(pwd)/data:/data \
nvcr.io/nvidia/tensorrt:24.05-py3
# 2. Install via pip (for development)
pip install tensorrt==10.0.1
pip install torch torchvision onnx onnxruntime-gpu
# 3. Verify installation
python3 -c "
import tensorrt as trt
print(f'TensorRT version: {trt.__version__}')
print(f'CUDA: {trt.cuda_version()}')
logger = trt.Logger(trt.Logger.INFO)
builder = trt.Builder(logger)
print(f'Max batch size: {builder.max_batch_size}')
print(f'Platform has fast FP16: {builder.platform_has_fast_fp16}')
print(f'Platform has fast INT8: {builder.platform_has_fast_int8}')
"
# 4. Docker Compose for TensorRT Service
cat > docker-compose.yml << 'EOF'
version: '3.8'
services:
triton:
image: nvcr.io/nvidia/tritoninferenceserver:24.05-py3
ports:
- "8000:8000" # HTTP
- "8001:8001" # gRPC
- "8002:8002" # Metrics
volumes:
- ./model_repository:/models
command: tritonserver --model-repository=/models --strict-model-config=false
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
api:
image: python:3.12-slim
ports:
- "8080:8080"
volumes:
- ./api:/app
command: uvicorn app.main:app --host 0.0.0.0 --port 8080
depends_on:
- triton
EOF
echo "TensorRT environment ready"
Model Optimization Pipeline
????????????????????? optimize model ???????????? TensorRT
#!/usr/bin/env python3
# tensorrt_optimize.py ??? Model Optimization Pipeline
import json
import logging
from typing import Dict, List
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger("optimize")
class TensorRTOptimizer:
"""TensorRT Model Optimization Pipeline"""
def __init__(self):
self.optimization_steps = []
def optimization_pipeline(self):
return {
"step_1_export_onnx": {
"description": "Export PyTorch model to ONNX format",
"code": """
import torch
import torchvision.models as models
model = models.resnet50(pretrained=True)
model.eval()
dummy_input = torch.randn(1, 3, 224, 224)
torch.onnx.export(
model, dummy_input, 'resnet50.onnx',
input_names=['input'],
output_names=['output'],
dynamic_axes={'input': {0: 'batch_size'}, 'output': {0: 'batch_size'}},
opset_version=17,
)
""",
},
"step_2_optimize_trt": {
"description": "Convert ONNX to TensorRT engine",
"code": """
# Using trtexec CLI
trtexec --onnx=resnet50.onnx \\
--saveEngine=resnet50_fp16.engine \\
--fp16 \\
--workspace=4096 \\
--minShapes=input:1x3x224x224 \\
--optShapes=input:8x3x224x224 \\
--maxShapes=input:32x3x224x224 \\
--verbose
""",
},
"step_3_int8_calibration": {
"description": "INT8 quantization with calibration",
"code": """
trtexec --onnx=resnet50.onnx \\
--saveEngine=resnet50_int8.engine \\
--int8 \\
--calib=calibration_data/ \\
--workspace=4096
""",
},
"step_4_benchmark": {
"description": "Benchmark optimized engine",
"code": """
trtexec --loadEngine=resnet50_fp16.engine \\
--batch=8 \\
--iterations=1000 \\
--warmUp=500 \\
--duration=10
""",
},
}
def performance_comparison(self):
return {
"resnet50": {
"pytorch_fp32": {"latency_ms": 12.5, "throughput_fps": 80, "gpu_mem_mb": 1200},
"tensorrt_fp32": {"latency_ms": 4.2, "throughput_fps": 238, "gpu_mem_mb": 800},
"tensorrt_fp16": {"latency_ms": 2.1, "throughput_fps": 476, "gpu_mem_mb": 450},
"tensorrt_int8": {"latency_ms": 1.3, "throughput_fps": 769, "gpu_mem_mb": 280},
},
"bert_base": {
"pytorch_fp32": {"latency_ms": 8.5, "throughput_qps": 118},
"tensorrt_fp16": {"latency_ms": 2.8, "throughput_qps": 357},
"tensorrt_int8": {"latency_ms": 1.9, "throughput_qps": 526},
},
"gpu": "NVIDIA A100 40GB",
}
optimizer = TensorRTOptimizer()
pipeline = optimizer.optimization_pipeline()
print("TensorRT Optimization Pipeline:")
for step, info in pipeline.items():
print(f" {step}: {info['description']}")
perf = optimizer.performance_comparison()
print(f"\nPerformance (ResNet50 on {perf['gpu']}):")
for config, metrics in perf["resnet50"].items():
print(f" {config}: {metrics['latency_ms']}ms, {metrics['throughput_fps']} FPS")
SaaS Architecture ?????????????????? AI Inference
?????????????????? SaaS inference platform
# === AI Inference SaaS Architecture ===
cat > saas_architecture.yaml << 'EOF'
ai_inference_saas:
components:
api_gateway:
technology: "Kong / AWS API Gateway"
features:
- "Rate limiting per API key/plan"
- "Authentication (API key, JWT, OAuth2)"
- "Request routing and load balancing"
- "Request/response transformation"
- "Usage metering for billing"
inference_service:
technology: "NVIDIA Triton Inference Server"
features:
- "Multi-model serving"
- "Dynamic batching (combine requests)"
- "Model versioning and A/B testing"
- "GPU sharing across models"
- "gRPC and HTTP endpoints"
config:
dynamic_batching:
preferred_batch_size: [4, 8, 16]
max_queue_delay_microseconds: 100000
instance_group:
- kind: KIND_GPU
count: 2
model_repository:
storage: "S3 / GCS / Azure Blob"
structure: |
model_repository/
resnet50/
config.pbtxt
1/
model.plan (TensorRT engine)
bert_base/
config.pbtxt
1/
model.plan
autoscaler:
technology: "Kubernetes HPA + KEDA"
metrics:
- "GPU utilization > 80%"
- "Request queue depth > 100"
- "P99 latency > SLA threshold"
scaling:
min_replicas: 1
max_replicas: 10
scale_up_time: "60s"
scale_down_time: "300s"
billing:
model: "Pay per inference"
tiers:
free: "100 requests/day"
basic: "$0.001/request, 10K/day"
pro: "$0.0005/request, 100K/day"
enterprise: "Custom pricing, SLA"
infrastructure:
gpu_instances:
- "NVIDIA T4 (cost-effective inference)"
- "NVIDIA A10G (balanced)"
- "NVIDIA A100 (high throughput)"
- "NVIDIA H100 (latest, fastest)"
kubernetes:
cluster: "EKS / GKE with GPU node pools"
gpu_operator: "NVIDIA GPU Operator"
device_plugin: "NVIDIA Device Plugin"
EOF
echo "SaaS architecture defined"
Scaling ????????? Cost Optimization
?????????????????????????????????????????????????????????????????????
#!/usr/bin/env python3
# cost_optimizer.py ??? AI Inference Cost Optimizer
import json
import logging
from typing import Dict
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger("cost")
class InferenceCostOptimizer:
def __init__(self):
pass
def gpu_pricing(self):
return {
"aws": {
"g4dn.xlarge": {"gpu": "T4", "price_hr": 0.526, "vram_gb": 16, "best_for": "Cost-effective inference"},
"g5.xlarge": {"gpu": "A10G", "price_hr": 1.006, "vram_gb": 24, "best_for": "Balanced performance"},
"p4d.24xlarge": {"gpu": "A100x8", "price_hr": 32.77, "vram_gb": 320, "best_for": "High throughput"},
},
"spot_savings": "60-90% discount with Spot instances",
}
def cost_per_inference(self, gpu_price_hr, throughput_qps):
"""Calculate cost per inference"""
cost_per_second = gpu_price_hr / 3600
cost_per_inference = cost_per_second / throughput_qps
cost_per_1k = cost_per_inference * 1000
return {
"cost_per_inference": round(cost_per_inference, 6),
"cost_per_1k_requests": round(cost_per_1k, 4),
"cost_per_1m_requests": round(cost_per_inference * 1000000, 2),
}
def optimization_strategies(self):
return {
"model_optimization": {
"fp16_quantization": "?????? latency 50%, ??????????????? throughput 2x, ?????? memory 50%",
"int8_quantization": "?????? latency 60%, ??????????????? throughput 3x, ?????? memory 70%",
"model_pruning": "?????????????????? model 30-50% ??????????????????????????????????????? accuracy ?????????",
"knowledge_distillation": "????????? model ????????????????????? model ????????????",
},
"infrastructure": {
"spot_instances": "??????????????? GPU 60-90% (??????????????? batch, ???????????????????????? real-time)",
"auto_scaling": "Scale down ??????????????? traffic ????????? (nights, weekends)",
"gpu_sharing": "????????? multiple models ?????? GPU ??????????????? (MIG, MPS)",
"right_sizing": "??????????????? GPU ????????????????????????????????? workload (T4 vs A100)",
},
"serving": {
"dynamic_batching": "????????? requests ???????????? batch ??????????????? GPU utilization",
"response_caching": "Cache results ?????????????????? repeated inputs",
"model_warmup": "Pre-load models ?????? cold start",
"request_coalescing": "????????? duplicate requests",
},
}
optimizer = InferenceCostOptimizer()
# Cost comparison: T4 vs A100
t4_cost = optimizer.cost_per_inference(0.526, 476) # T4 + TensorRT FP16
a100_cost = optimizer.cost_per_inference(3.673, 2000) # A100 + TensorRT FP16
print("Cost per 1M Inferences:")
print(f" T4 (FP16): ")
print(f" A100 (FP16): ")
strategies = optimizer.optimization_strategies()
print("\nOptimization Strategies:")
for category, items in strategies.items():
print(f"\n {category}:")
for name, desc in list(items.items())[:2]:
print(f" {name}: {desc}")
Monitoring ????????? Performance Tuning
?????????????????? inference performance
# === Inference Monitoring ===
# 1. Triton Metrics (Prometheus format)
cat > prometheus-triton.yml << 'EOF'
scrape_configs:
- job_name: 'triton'
scrape_interval: 5s
static_configs:
- targets: ['triton:8002']
metrics_path: /metrics
EOF
# 2. Key Metrics to Monitor
cat > monitoring_config.yaml << 'EOF'
inference_metrics:
latency:
- name: "nv_inference_request_duration_us"
description: "End-to-end inference latency"
alert_threshold: "P99 > 100ms"
- name: "nv_inference_queue_duration_us"
description: "Time spent in queue"
alert_threshold: "P99 > 50ms"
- name: "nv_inference_compute_infer_duration_us"
description: "GPU compute time"
throughput:
- name: "nv_inference_request_success"
description: "Successful inferences per second"
- name: "nv_inference_request_failure"
description: "Failed inferences (should be ~0)"
gpu:
- name: "DCGM_FI_DEV_GPU_UTIL"
description: "GPU utilization %"
target: "60-80%"
- name: "DCGM_FI_DEV_FB_USED"
description: "GPU memory used"
- name: "DCGM_FI_DEV_GPU_TEMP"
description: "GPU temperature"
alert: "> 85C"
business:
- name: "requests_per_api_key"
description: "Requests per customer"
- name: "cost_per_inference"
description: "Running cost per inference"
- name: "sla_compliance"
description: "% requests within SLA latency"
EOF
# 3. Grafana Dashboard Query Examples
cat > grafana_queries.txt << 'EOF'
# P99 Latency
histogram_quantile(0.99, rate(nv_inference_request_duration_us_bucket[5m]))
# Throughput (requests/sec)
rate(nv_inference_request_success[1m])
# GPU Utilization
avg(DCGM_FI_DEV_GPU_UTIL)
# Error Rate
rate(nv_inference_request_failure[5m]) / rate(nv_inference_request_success[5m]) * 100
# Queue Depth
nv_inference_pending_request_count
EOF
echo "Monitoring configured"
FAQ ??????????????????????????????????????????
Q: TensorRT ????????? ONNX Runtime ???????????????????????????????????????????
A: TensorRT ???????????? NVIDIA-specific optimizer ????????? performance ???????????????????????? NVIDIA GPUs ?????????????????? FP16/INT8 quantization, layer fusion, kernel auto-tuning ??????????????? GPU ??????????????????????????? ???????????????????????? ????????????????????????????????? NVIDIA GPUs, TensorRT engine ?????????????????? GPU architecture (???????????? rebuild ???????????????????????????????????? GPU) ONNX Runtime ?????????????????? multiple backends (CPU, GPU, NPU) portable ???????????? ???????????????????????? Intel, AMD, ARM, NVIDIA ??????????????????????????????????????? deploy ????????? performance ?????? NVIDIA GPU ????????????????????? TensorRT 20-40% ??????????????? ????????? TensorRT ?????????????????? production NVIDIA GPU inference, ????????? ONNX Runtime ?????????????????? multi-platform deployment ???????????? CPU inference
Q: FP16 ????????? INT8 quantization ??????????????????????????????????
A: FP16 (Half Precision) ?????? latency ~50%, accuracy loss ????????????????????? (< 0.1%), ????????????????????? calibration ????????????????????????????????? ????????????????????????????????? ????????? model ?????????????????????????????? balance ????????????????????? speed ????????? accuracy INT8 (8-bit Integer) ?????? latency ~60-70%, accuracy loss ????????????????????? FP16 (0.5-2%), ???????????? calibration ???????????? representative dataset ?????????????????? calibration data ??????????????? ????????????????????????????????? CNN models (ResNet, YOLO) ????????? tolerate accuracy loss ????????? ??????????????? ???????????????????????? FP16 (???????????? ????????????) ????????? latency ???????????????????????? ????????????????????? INT8 ???????????? accuracy ???????????? deploy
Q: Triton Inference Server ????????????????????????????
A: ???????????????????????????????????????????????????????????????????????????????????????????????? production Triton ????????? Dynamic batching ????????? requests ??????????????????????????? ??????????????? GPU utilization, Model versioning deploy model ?????????????????????????????? downtime, Multi-model serving ????????? models ??????????????????????????? GPU ???????????????, Health checks, metrics, model management ??????????????????????????? Triton ???????????? implement features ????????????????????????????????? ?????????????????????????????????????????? Alternatives ???????????? TorchServe (PyTorch), TensorFlow Serving, BentoML, Ray Serve ????????? Triton ?????????????????? TensorRT ???????????????????????????????????????????????????????????? NVIDIA ??????????????????????????? ?????????????????? prototype ????????? FastAPI + TensorRT SDK ???????????? ?????????????????? production ????????? Triton
Q: GPU ?????????????????????????????????????????? inference SaaS?
A: ????????????????????? workload ????????? budget NVIDIA T4 ??????????????????????????????????????? ($0.526/hr AWS) 16GB VRAM, FP16/INT8 ?????? ??????????????? small-medium models, cost-sensitive workloads NVIDIA A10G balanced ($1/hr) 24GB VRAM ??????????????? medium models, moderate throughput NVIDIA A100 high-end ($3.67/hr 40GB) 40/80GB VRAM ??????????????? large models (LLMs), high throughput NVIDIA H100 latest ($6+/hr) 80GB VRAM, fastest ??????????????? LLM inference, latency-critical ?????????????????? startup ???????????????????????? T4 (Spot instances) ???????????????????????????????????? ??????????????? traffic ??????????????????????????? upgrade ???????????? A10G/A100 ????????? auto-scaling ??????????????????????????? GPU ????????? demand
