Whisper Speech Site Reliability SRE — ดูแล
Whisper Speech SRE
Whisper Speech Site Reliability SRE Infrastructure Auto-scaling Monitoring Incident Response Capacity Planning GPU Cluster Production Operations
| SLI | SLO Target | Measurement | Alert Threshold |
|---|---|---|---|
| Availability | 99.9% uptime | Successful requests / Total | < 99.5% (15min window) |
| Latency (short audio) | p99 < 5s for < 30s audio | Request duration histogram | p99 > 8s |
| Latency (long audio) | p99 < 30s for < 5min audio | Request duration histogram | p99 > 45s |
| Error Rate | < 0.1% | 5xx / Total requests | > 0.5% |
| Queue Wait | p95 < 10s | Time in queue before processing | p95 > 20s |
| GPU Utilization | 60-80% | nvidia-smi metrics | > 90% or < 30% |
Infrastructure Architecture
=== Whisper API Infrastructure ===
Kubernetes Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: whisper-api
spec:
replicas: 3
selector:
matchLabels:
app: whisper-api
template:
spec:
containers:
- name: whisper
image: registry.io/whisper-api:v2.1
ports:
- containerPort: 8000
resources:
requests:
cpu: "4"
memory: "16Gi"
nvidia.com/gpu: "1"
limits:
cpu: "8"
memory: "32Gi"
nvidia.com/gpu: "1"
env:
- name: MODEL_SIZE
value: "large-v3"
- name: COMPUTE_TYPE
value: "float16"
- name: NUM_WORKERS
value: "2"
readinessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 30
livenessProbe:
httpGet:
path: /health
port: 8000
periodSeconds: 30
HPA with GPU metrics
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: whisper-hpa
spec:
scaleTargetRef:
kind: Deployment
name: whisper-api
minReplicas: 2
maxReplicas: 10
metrics:
- type: Pods
pods:
metric:
name: gpu_utilization
target:
type: AverageValue
averageValue: "75"
from dataclasses import dataclass
@dataclass
class InfraComponent:
component: str
spec: str
replicas: str
purpose: str
infra = [
InfraComponent("Whisper API Pod", "T4 GPU + 16GB RAM", "2-10 (HPA)", "Transcription processing"),
InfraComponent("Load Balancer", "Nginx Ingress", "2 (HA)", "Route requests, SSL termination"),
InfraComponent("Redis Queue", "8GB RAM", "3 (Sentinel)", "Job queue, rate limiting"),
InfraComponent("Object Storage", "S3/MinIO", "HA", "Audio file upload storage"),
InfraComponent("Prometheus", "4 CPU 16GB", "2 (HA)", "Metrics collection"),
InfraComponent("Grafana", "2 CPU 4GB", "1", "Dashboard visualization"),
]
print("=== Infrastructure ===")
for i in infra:
print(f" [{i.component}] Spec: {i.spec}")
print(f" Replicas: {i.replicas} | Purpose: {i.purpose}")
Monitoring and Alerting
=== Prometheus Metrics and Alerts ===
Custom metrics in FastAPI
from prometheus_client import Histogram, Counter, Gauge
TRANSCRIPTION_DURATION = Histogram(
'whisper_transcription_seconds',
'Transcription processing time',
['model_size', 'audio_duration_bucket'],
buckets=[1, 2, 5, 10, 20, 30, 60, 120]
)
TRANSCRIPTION_ERRORS = Counter(
'whisper_transcription_errors_total',
'Total transcription errors',
['error_type']
)
GPU_MEMORY_USED = Gauge(
'whisper_gpu_memory_bytes',
'GPU memory usage'
)
QUEUE_LENGTH = Gauge(
'whisper_queue_length',
'Number of jobs waiting in queue'
)
Prometheus Alert Rules
groups:
- name: whisper-sre
rules:
- alert: WhisperHighLatency
expr: histogram_quantile(0.99, whisper_transcription_seconds_bucket) > 30
for: 5m
labels: { severity: critical }
annotations:
summary: "Whisper p99 latency > 30s"
- alert: WhisperHighErrorRate
expr: rate(whisper_transcription_errors_total[5m]) / rate(whisper_transcription_seconds_count[5m]) > 0.005
for: 5m
labels: { severity: critical }
- alert: WhisperGPUOverload
expr: avg(nvidia_gpu_utilization) > 90
for: 10m
labels: { severity: warning }
annotations:
summary: "GPU utilization > 90%, consider scaling up"
- alert: WhisperQueueBacklog
expr: whisper_queue_length > 50
for: 5m
labels: { severity: warning }
@dataclass
class AlertConfig:
alert: str
condition: str
severity: str
response: str
auto_action: str
alerts = [
AlertConfig("High Latency", "p99 > 30s for 5min", "P1 Critical",
"Scale up GPU pods, check queue", "HPA auto-scale"),
AlertConfig("High Error Rate", "> 0.5% errors for 5min", "P1 Critical",
"Check GPU OOM, model loading, disk space", "Circuit breaker"),
AlertConfig("GPU Overload", "> 90% utilization for 10min", "P2 Warning",
"Scale up or optimize batch size", "HPA auto-scale"),
AlertConfig("Queue Backlog", "> 50 jobs waiting for 5min", "P2 Warning",
"Scale up workers, check processing speed", "HPA auto-scale"),
AlertConfig("GPU OOM", "GPU memory > 95%", "P1 Critical",
"Reduce batch size, restart pod", "Auto-restart"),
AlertConfig("Pod Crash", "Container restart > 3 in 10min", "P1 Critical",
"Check logs, GPU driver, model file", "Alert on-call"),
]
print("Alert Configuration:")
for a in alerts:
print(f" [{a.alert}] {a.condition}")
print(f" Severity: {a.severity}")
print(f" Response: {a.response}")
print(f" Auto: {a.auto_action}")
Incident Response
# === Incident Runbook ===
@dataclass
class Runbook:
incident: str
detection: str
diagnosis: str
mitigation: str
prevention: str
runbooks = [
Runbook("GPU Out of Memory",
"Alert: GPU memory > 95%, pod OOMKilled",
"kubectl logs pod | grep OOM; nvidia-smi on node",
"Reduce BATCH_SIZE, restart pod; scale horizontally",
"Set memory limits, use streaming for long audio"),
Runbook("High Latency Spike",
"Alert: p99 latency > 30s",
"Check queue length, GPU util, concurrent requests",
"Scale up HPA, enable request queuing with timeout",
"Capacity planning, pre-warm instances before peak"),
Runbook("Model Loading Failure",
"Alert: pod not ready, health check fail",
"kubectl describe pod; check model download, disk space",
"Rollback to previous image, pre-cache model in PVC",
"Use init container to download model, PVC cache"),
Runbook("Audio Processing Error",
"Alert: error rate > 0.5%",
"Check error logs, audio format, ffmpeg issues",
"Add input validation, fallback to smaller model",
"Validate audio format before processing, retry logic"),
]
print("=== Incident Runbooks ===")
for r in runbooks:
print(f" [{r.incident}]")
print(f" Detect: {r.detection}")
print(f" Diagnose: {r.diagnosis}")
print(f" Mitigate: {r.mitigation}")
print(f" Prevent: {r.prevention}")
# Error Budget
budget = {
"Monthly SLO": "99.9% availability = 43.2 minutes downtime allowed",
"Current Month": "99.95% = 21.6 minutes used (50% budget remaining)",
"Burn Rate": "Normal (< 1x)",
"Action": "Continue feature work (budget healthy)",
"If > 80% burned": "Freeze deploys, focus on reliability",
"If > 100% burned": "All hands on reliability, no features",
}
print(f"\n\nError Budget:")
for k, v in budget.items():
print(f" [{k}]: {v}")
เคล็ดลับ
- Faster-Whisper: ใช้ Faster-Whisper (CTranslate2) เร็วกว่า 4x ใช้ RAM น้อยกว่า
- Queue: ใช้ Redis Queue แยก Upload กับ Processing ไม่ Block API
- Cache: Cache Model ใน PVC ไม่ต้อง Download ทุกครั้งที่ Pod Start
- Streaming: ใช้ Streaming Transcription สำหรับ Audio ยาว ส่งผลทีละส่วน
- Error Budget: ติดตาม Error Budget ถ้าใช้เกิน 80% หยุด Deploy Feature
Whisper Speech คืออะไร
OpenAI Speech Recognition 99 ภาษา ไทย tiny base small medium large STT Translation Transformer Local GPU Faster-Whisper 4x