Whisper Speech SRE
Whisper Speech Site Reliability SRE Infrastructure Auto-scaling Monitoring Incident Response Capacity Planning GPU Cluster Production Operations
| SLI | SLO Target | Measurement | Alert Threshold |
|---|---|---|---|
| Availability | 99.9% uptime | Successful requests / Total | < 99.5% (15min window) |
| Latency (short audio) | p99 < 5s for < 30s audio | Request duration histogram | p99 > 8s |
| Latency (long audio) | p99 < 30s for < 5min audio | Request duration histogram | p99 > 45s |
| Error Rate | < 0.1% | 5xx / Total requests | > 0.5% |
| Queue Wait | p95 < 10s | Time in queue before processing | p95 > 20s |
| GPU Utilization | 60-80% | nvidia-smi metrics | > 90% or < 30% |
Infrastructure Architecture
# === Whisper API Infrastructure ===
# Kubernetes Deployment
# apiVersion: apps/v1
# kind: Deployment
# metadata:
# name: whisper-api
# spec:
# replicas: 3
# selector:
# matchLabels:
# app: whisper-api
# template:
# spec:
# containers:
# - name: whisper
# image: registry.io/whisper-api:v2.1
# ports:
# - containerPort: 8000
# resources:
# requests:
# cpu: "4"
# memory: "16Gi"
# nvidia.com/gpu: "1"
# limits:
# cpu: "8"
# memory: "32Gi"
# nvidia.com/gpu: "1"
# env:
# - name: MODEL_SIZE
# value: "large-v3"
# - name: COMPUTE_TYPE
# value: "float16"
# - name: NUM_WORKERS
# value: "2"
# readinessProbe:
# httpGet:
# path: /health
# port: 8000
# initialDelaySeconds: 30
# livenessProbe:
# httpGet:
# path: /health
# port: 8000
# periodSeconds: 30
# HPA with GPU metrics
# apiVersion: autoscaling/v2
# kind: HorizontalPodAutoscaler
# metadata:
# name: whisper-hpa
# spec:
# scaleTargetRef:
# kind: Deployment
# name: whisper-api
# minReplicas: 2
# maxReplicas: 10
# metrics:
# - type: Pods
# pods:
# metric:
# name: gpu_utilization
# target:
# type: AverageValue
# averageValue: "75"
from dataclasses import dataclass
@dataclass
class InfraComponent:
component: str
spec: str
replicas: str
purpose: str
infra = [
InfraComponent("Whisper API Pod", "T4 GPU + 16GB RAM", "2-10 (HPA)", "Transcription processing"),
InfraComponent("Load Balancer", "Nginx Ingress", "2 (HA)", "Route requests, SSL termination"),
InfraComponent("Redis Queue", "8GB RAM", "3 (Sentinel)", "Job queue, rate limiting"),
InfraComponent("Object Storage", "S3/MinIO", "HA", "Audio file upload storage"),
InfraComponent("Prometheus", "4 CPU 16GB", "2 (HA)", "Metrics collection"),
InfraComponent("Grafana", "2 CPU 4GB", "1", "Dashboard visualization"),
]
print("=== Infrastructure ===")
for i in infra:
print(f" [{i.component}] Spec: {i.spec}")
print(f" Replicas: {i.replicas} | Purpose: {i.purpose}")
Monitoring and Alerting
# === Prometheus Metrics and Alerts ===
# Custom metrics in FastAPI
# from prometheus_client import Histogram, Counter, Gauge
#
# TRANSCRIPTION_DURATION = Histogram(
# 'whisper_transcription_seconds',
# 'Transcription processing time',
# ['model_size', 'audio_duration_bucket'],
# buckets=[1, 2, 5, 10, 20, 30, 60, 120]
# )
# TRANSCRIPTION_ERRORS = Counter(
# 'whisper_transcription_errors_total',
# 'Total transcription errors',
# ['error_type']
# )
# GPU_MEMORY_USED = Gauge(
# 'whisper_gpu_memory_bytes',
# 'GPU memory usage'
# )
# QUEUE_LENGTH = Gauge(
# 'whisper_queue_length',
# 'Number of jobs waiting in queue'
# )
# Prometheus Alert Rules
# groups:
# - name: whisper-sre
# rules:
# - alert: WhisperHighLatency
# expr: histogram_quantile(0.99, whisper_transcription_seconds_bucket) > 30
# for: 5m
# labels: { severity: critical }
# annotations:
# summary: "Whisper p99 latency > 30s"
#
# - alert: WhisperHighErrorRate
# expr: rate(whisper_transcription_errors_total[5m]) / rate(whisper_transcription_seconds_count[5m]) > 0.005
# for: 5m
# labels: { severity: critical }
#
# - alert: WhisperGPUOverload
# expr: avg(nvidia_gpu_utilization) > 90
# for: 10m
# labels: { severity: warning }
# annotations:
# summary: "GPU utilization > 90%, consider scaling up"
#
# - alert: WhisperQueueBacklog
# expr: whisper_queue_length > 50
# for: 5m
# labels: { severity: warning }
@dataclass
class AlertConfig:
alert: str
condition: str
severity: str
response: str
auto_action: str
alerts = [
AlertConfig("High Latency", "p99 > 30s for 5min", "P1 Critical",
"Scale up GPU pods, check queue", "HPA auto-scale"),
AlertConfig("High Error Rate", "> 0.5% errors for 5min", "P1 Critical",
"Check GPU OOM, model loading, disk space", "Circuit breaker"),
AlertConfig("GPU Overload", "> 90% utilization for 10min", "P2 Warning",
"Scale up or optimize batch size", "HPA auto-scale"),
AlertConfig("Queue Backlog", "> 50 jobs waiting for 5min", "P2 Warning",
"Scale up workers, check processing speed", "HPA auto-scale"),
AlertConfig("GPU OOM", "GPU memory > 95%", "P1 Critical",
"Reduce batch size, restart pod", "Auto-restart"),
AlertConfig("Pod Crash", "Container restart > 3 in 10min", "P1 Critical",
"Check logs, GPU driver, model file", "Alert on-call"),
]
print("Alert Configuration:")
for a in alerts:
print(f" [{a.alert}] {a.condition}")
print(f" Severity: {a.severity}")
print(f" Response: {a.response}")
print(f" Auto: {a.auto_action}")
Incident Response
# === Incident Runbook ===
@dataclass
class Runbook:
incident: str
detection: str
diagnosis: str
mitigation: str
prevention: str
runbooks = [
Runbook("GPU Out of Memory",
"Alert: GPU memory > 95%, pod OOMKilled",
"kubectl logs pod | grep OOM; nvidia-smi on node",
"Reduce BATCH_SIZE, restart pod; scale horizontally",
"Set memory limits, use streaming for long audio"),
Runbook("High Latency Spike",
"Alert: p99 latency > 30s",
"Check queue length, GPU util, concurrent requests",
"Scale up HPA, enable request queuing with timeout",
"Capacity planning, pre-warm instances before peak"),
Runbook("Model Loading Failure",
"Alert: pod not ready, health check fail",
"kubectl describe pod; check model download, disk space",
"Rollback to previous image, pre-cache model in PVC",
"Use init container to download model, PVC cache"),
Runbook("Audio Processing Error",
"Alert: error rate > 0.5%",
"Check error logs, audio format, ffmpeg issues",
"Add input validation, fallback to smaller model",
"Validate audio format before processing, retry logic"),
]
print("=== Incident Runbooks ===")
for r in runbooks:
print(f" [{r.incident}]")
print(f" Detect: {r.detection}")
print(f" Diagnose: {r.diagnosis}")
print(f" Mitigate: {r.mitigation}")
print(f" Prevent: {r.prevention}")
# Error Budget
budget = {
"Monthly SLO": "99.9% availability = 43.2 minutes downtime allowed",
"Current Month": "99.95% = 21.6 minutes used (50% budget remaining)",
"Burn Rate": "Normal (< 1x)",
"Action": "Continue feature work (budget healthy)",
"If > 80% burned": "Freeze deploys, focus on reliability",
"If > 100% burned": "All hands on reliability, no features",
}
print(f"\n\nError Budget:")
for k, v in budget.items():
print(f" [{k}]: {v}")
เคล็ดลับ
- Faster-Whisper: ใช้ Faster-Whisper (CTranslate2) เร็วกว่า 4x ใช้ RAM น้อยกว่า
- Queue: ใช้ Redis Queue แยก Upload กับ Processing ไม่ Block API
- Cache: Cache Model ใน PVC ไม่ต้อง Download ทุกครั้งที่ Pod Start
- Streaming: ใช้ Streaming Transcription สำหรับ Audio ยาว ส่งผลทีละส่วน
- Error Budget: ติดตาม Error Budget ถ้าใช้เกิน 80% หยุด Deploy Feature
Whisper Speech คืออะไร
OpenAI Speech Recognition 99 ภาษา ไทย tiny base small medium large STT Translation Transformer Local GPU Faster-Whisper 4x
SRE สำหรับ Speech API ทำอะไร
Uptime 99.9% GPU Cluster Auto-scaling Monitoring Alert Latency Error Rate Incident Response Capacity Planning Performance Tuning Load Balancing
ตั้ง Monitoring อย่างไร
Latency p50 p95 p99 WER GPU Utilization 60-80% Queue Length Error Rate 0.1% Throughput Audio Minutes Prometheus Grafana PagerDuty SLO
จัดการ Incident อย่างไร
On-call 24/7 PagerDuty Runbook GPU OOM High Latency Queue Overflow P1 5 นาที P2 15 นาที Post-mortem Error Budget
สรุป
Whisper Speech SRE Infrastructure GPU Auto-scaling HPA Monitoring Prometheus Grafana Alert Incident Runbook Error Budget Capacity Planning Faster-Whisper Production
