SiamCafe · Blog
Whisper Speech Site Reliability SRE — ดูแล
บทความ

Whisper Speech Site Reliability SRE — ดูแล

เผยแพร่ 28 พฤษภาคม 2569

Whisper Speech SRE

Whisper Speech Site Reliability SRE Infrastructure Auto-scaling Monitoring Incident Response Capacity Planning GPU Cluster Production Operations

SLISLO TargetMeasurementAlert Threshold
Availability99.9% uptimeSuccessful requests / Total< 99.5% (15min window)
Latency (short audio)p99 < 5s for < 30s audioRequest duration histogramp99 > 8s
Latency (long audio)p99 < 30s for < 5min audioRequest duration histogramp99 > 45s
Error Rate< 0.1%5xx / Total requests> 0.5%
Queue Waitp95 < 10sTime in queue before processingp95 > 20s
GPU Utilization60-80%nvidia-smi metrics> 90% or < 30%

Infrastructure Architecture

=== Whisper API Infrastructure ===

Kubernetes Deployment

apiVersion: apps/v1

kind: Deployment

metadata:

name: whisper-api

spec:

replicas: 3

selector:

matchLabels:

app: whisper-api

template:

spec:

containers:

  • name: whisper

image: registry.io/whisper-api:v2.1

ports:

  • containerPort: 8000

resources:

requests:

cpu: "4"

memory: "16Gi"

nvidia.com/gpu: "1"

limits:

cpu: "8"

memory: "32Gi"

nvidia.com/gpu: "1"

env:

  • name: MODEL_SIZE

value: "large-v3"

  • name: COMPUTE_TYPE

value: "float16"

  • name: NUM_WORKERS

value: "2"

readinessProbe:

httpGet:

path: /health

port: 8000

initialDelaySeconds: 30

livenessProbe:

httpGet:

path: /health

port: 8000

periodSeconds: 30

HPA with GPU metrics

apiVersion: autoscaling/v2

kind: HorizontalPodAutoscaler

metadata:

name: whisper-hpa

spec:

scaleTargetRef:

kind: Deployment

name: whisper-api

minReplicas: 2

maxReplicas: 10

metrics:

  • type: Pods

pods:

metric:

name: gpu_utilization

target:

type: AverageValue

averageValue: "75"

from dataclasses import dataclass

@dataclass

class InfraComponent:

component: str

spec: str

replicas: str

purpose: str

infra = [

InfraComponent("Whisper API Pod", "T4 GPU + 16GB RAM", "2-10 (HPA)", "Transcription processing"),

InfraComponent("Load Balancer", "Nginx Ingress", "2 (HA)", "Route requests, SSL termination"),

InfraComponent("Redis Queue", "8GB RAM", "3 (Sentinel)", "Job queue, rate limiting"),

InfraComponent("Object Storage", "S3/MinIO", "HA", "Audio file upload storage"),

InfraComponent("Prometheus", "4 CPU 16GB", "2 (HA)", "Metrics collection"),

InfraComponent("Grafana", "2 CPU 4GB", "1", "Dashboard visualization"),

]

print("=== Infrastructure ===")

for i in infra:

print(f" [{i.component}] Spec: {i.spec}")

print(f" Replicas: {i.replicas} | Purpose: {i.purpose}")

Monitoring and Alerting

=== Prometheus Metrics and Alerts ===

Custom metrics in FastAPI

from prometheus_client import Histogram, Counter, Gauge

TRANSCRIPTION_DURATION = Histogram(

'whisper_transcription_seconds',

'Transcription processing time',

['model_size', 'audio_duration_bucket'],

buckets=[1, 2, 5, 10, 20, 30, 60, 120]

)

TRANSCRIPTION_ERRORS = Counter(

'whisper_transcription_errors_total',

'Total transcription errors',

['error_type']

)

GPU_MEMORY_USED = Gauge(

'whisper_gpu_memory_bytes',

'GPU memory usage'

)

QUEUE_LENGTH = Gauge(

'whisper_queue_length',

'Number of jobs waiting in queue'

)

Prometheus Alert Rules

groups:

  • name: whisper-sre

rules:

  • alert: WhisperHighLatency

expr: histogram_quantile(0.99, whisper_transcription_seconds_bucket) > 30

for: 5m

labels: { severity: critical }

annotations:

summary: "Whisper p99 latency > 30s"

  • alert: WhisperHighErrorRate

expr: rate(whisper_transcription_errors_total[5m]) / rate(whisper_transcription_seconds_count[5m]) > 0.005

for: 5m

labels: { severity: critical }

  • alert: WhisperGPUOverload

expr: avg(nvidia_gpu_utilization) > 90

for: 10m

labels: { severity: warning }

annotations:

summary: "GPU utilization > 90%, consider scaling up"

  • alert: WhisperQueueBacklog

expr: whisper_queue_length > 50

for: 5m

labels: { severity: warning }

@dataclass

class AlertConfig:

alert: str

condition: str

severity: str

response: str

auto_action: str

alerts = [

AlertConfig("High Latency", "p99 > 30s for 5min", "P1 Critical",

"Scale up GPU pods, check queue", "HPA auto-scale"),

AlertConfig("High Error Rate", "> 0.5% errors for 5min", "P1 Critical",

"Check GPU OOM, model loading, disk space", "Circuit breaker"),

AlertConfig("GPU Overload", "> 90% utilization for 10min", "P2 Warning",

"Scale up or optimize batch size", "HPA auto-scale"),

AlertConfig("Queue Backlog", "> 50 jobs waiting for 5min", "P2 Warning",

"Scale up workers, check processing speed", "HPA auto-scale"),

AlertConfig("GPU OOM", "GPU memory > 95%", "P1 Critical",

"Reduce batch size, restart pod", "Auto-restart"),

AlertConfig("Pod Crash", "Container restart > 3 in 10min", "P1 Critical",

"Check logs, GPU driver, model file", "Alert on-call"),

]

print("Alert Configuration:")

for a in alerts:

print(f" [{a.alert}] {a.condition}")

print(f" Severity: {a.severity}")

print(f" Response: {a.response}")

print(f" Auto: {a.auto_action}")

Incident Response

# === Incident Runbook ===

@dataclass
class Runbook:
    incident: str
    detection: str
    diagnosis: str
    mitigation: str
    prevention: str

runbooks = [
    Runbook("GPU Out of Memory",
        "Alert: GPU memory > 95%, pod OOMKilled",
        "kubectl logs pod | grep OOM; nvidia-smi on node",
        "Reduce BATCH_SIZE, restart pod; scale horizontally",
        "Set memory limits, use streaming for long audio"),
    Runbook("High Latency Spike",
        "Alert: p99 latency > 30s",
        "Check queue length, GPU util, concurrent requests",
        "Scale up HPA, enable request queuing with timeout",
        "Capacity planning, pre-warm instances before peak"),
    Runbook("Model Loading Failure",
        "Alert: pod not ready, health check fail",
        "kubectl describe pod; check model download, disk space",
        "Rollback to previous image, pre-cache model in PVC",
        "Use init container to download model, PVC cache"),
    Runbook("Audio Processing Error",
        "Alert: error rate > 0.5%",
        "Check error logs, audio format, ffmpeg issues",
        "Add input validation, fallback to smaller model",
        "Validate audio format before processing, retry logic"),
]

print("=== Incident Runbooks ===")
for r in runbooks:
    print(f"  [{r.incident}]")
    print(f"    Detect: {r.detection}")
    print(f"    Diagnose: {r.diagnosis}")
    print(f"    Mitigate: {r.mitigation}")
    print(f"    Prevent: {r.prevention}")

# Error Budget
budget = {
    "Monthly SLO": "99.9% availability = 43.2 minutes downtime allowed",
    "Current Month": "99.95% = 21.6 minutes used (50% budget remaining)",
    "Burn Rate": "Normal (< 1x)",
    "Action": "Continue feature work (budget healthy)",
    "If > 80% burned": "Freeze deploys, focus on reliability",
    "If > 100% burned": "All hands on reliability, no features",
}

print(f"\n\nError Budget:")
for k, v in budget.items():
    print(f"  [{k}]: {v}")

เคล็ดลับ

  • Faster-Whisper: ใช้ Faster-Whisper (CTranslate2) เร็วกว่า 4x ใช้ RAM น้อยกว่า
  • Queue: ใช้ Redis Queue แยก Upload กับ Processing ไม่ Block API
  • Cache: Cache Model ใน PVC ไม่ต้อง Download ทุกครั้งที่ Pod Start
  • Streaming: ใช้ Streaming Transcription สำหรับ Audio ยาว ส่งผลทีละส่วน
  • Error Budget: ติดตาม Error Budget ถ้าใช้เกิน 80% หยุด Deploy Feature

Whisper Speech คืออะไร

OpenAI Speech Recognition 99 ภาษา ไทย tiny base small medium large STT Translation Transformer Local GPU Faster-Whisper 4x