SiamCafe.net Blog
Technology

Whisper Speech Site Reliability SRE

whisper speech site reliability sre
Whisper Speech Site Reliability SRE | SiamCafe Blog
2025-06-25· อ. บอม — SiamCafe.net· 11,703 คำ

Whisper Speech SRE

Whisper Speech Site Reliability SRE Infrastructure Auto-scaling Monitoring Incident Response Capacity Planning GPU Cluster Production Operations

SLISLO TargetMeasurementAlert Threshold
Availability99.9% uptimeSuccessful requests / Total< 99.5% (15min window)
Latency (short audio)p99 < 5s for < 30s audioRequest duration histogramp99 > 8s
Latency (long audio)p99 < 30s for < 5min audioRequest duration histogramp99 > 45s
Error Rate< 0.1%5xx / Total requests> 0.5%
Queue Waitp95 < 10sTime in queue before processingp95 > 20s
GPU Utilization60-80%nvidia-smi metrics> 90% or < 30%

Infrastructure Architecture

# === Whisper API Infrastructure ===

# Kubernetes Deployment
# apiVersion: apps/v1
# kind: Deployment
# metadata:
#   name: whisper-api
# spec:
#   replicas: 3
#   selector:
#     matchLabels:
#       app: whisper-api
#   template:
#     spec:
#       containers:
#         - name: whisper
#           image: registry.io/whisper-api:v2.1
#           ports:
#             - containerPort: 8000
#           resources:
#             requests:
#               cpu: "4"
#               memory: "16Gi"
#               nvidia.com/gpu: "1"
#             limits:
#               cpu: "8"
#               memory: "32Gi"
#               nvidia.com/gpu: "1"
#           env:
#             - name: MODEL_SIZE
#               value: "large-v3"
#             - name: COMPUTE_TYPE
#               value: "float16"
#             - name: NUM_WORKERS
#               value: "2"
#           readinessProbe:
#             httpGet:
#               path: /health
#               port: 8000
#             initialDelaySeconds: 30
#           livenessProbe:
#             httpGet:
#               path: /health
#               port: 8000
#             periodSeconds: 30

# HPA with GPU metrics
# apiVersion: autoscaling/v2
# kind: HorizontalPodAutoscaler
# metadata:
#   name: whisper-hpa
# spec:
#   scaleTargetRef:
#     kind: Deployment
#     name: whisper-api
#   minReplicas: 2
#   maxReplicas: 10
#   metrics:
#     - type: Pods
#       pods:
#         metric:
#           name: gpu_utilization
#         target:
#           type: AverageValue
#           averageValue: "75"

from dataclasses import dataclass

@dataclass
class InfraComponent:
    component: str
    spec: str
    replicas: str
    purpose: str

infra = [
    InfraComponent("Whisper API Pod", "T4 GPU + 16GB RAM", "2-10 (HPA)", "Transcription processing"),
    InfraComponent("Load Balancer", "Nginx Ingress", "2 (HA)", "Route requests, SSL termination"),
    InfraComponent("Redis Queue", "8GB RAM", "3 (Sentinel)", "Job queue, rate limiting"),
    InfraComponent("Object Storage", "S3/MinIO", "HA", "Audio file upload storage"),
    InfraComponent("Prometheus", "4 CPU 16GB", "2 (HA)", "Metrics collection"),
    InfraComponent("Grafana", "2 CPU 4GB", "1", "Dashboard visualization"),
]

print("=== Infrastructure ===")
for i in infra:
    print(f"  [{i.component}] Spec: {i.spec}")
    print(f"    Replicas: {i.replicas} | Purpose: {i.purpose}")

Monitoring and Alerting

# === Prometheus Metrics and Alerts ===

# Custom metrics in FastAPI
# from prometheus_client import Histogram, Counter, Gauge
#
# TRANSCRIPTION_DURATION = Histogram(
#     'whisper_transcription_seconds',
#     'Transcription processing time',
#     ['model_size', 'audio_duration_bucket'],
#     buckets=[1, 2, 5, 10, 20, 30, 60, 120]
# )
# TRANSCRIPTION_ERRORS = Counter(
#     'whisper_transcription_errors_total',
#     'Total transcription errors',
#     ['error_type']
# )
# GPU_MEMORY_USED = Gauge(
#     'whisper_gpu_memory_bytes',
#     'GPU memory usage'
# )
# QUEUE_LENGTH = Gauge(
#     'whisper_queue_length',
#     'Number of jobs waiting in queue'
# )

# Prometheus Alert Rules
# groups:
#   - name: whisper-sre
#     rules:
#       - alert: WhisperHighLatency
#         expr: histogram_quantile(0.99, whisper_transcription_seconds_bucket) > 30
#         for: 5m
#         labels: { severity: critical }
#         annotations:
#           summary: "Whisper p99 latency > 30s"
#
#       - alert: WhisperHighErrorRate
#         expr: rate(whisper_transcription_errors_total[5m]) / rate(whisper_transcription_seconds_count[5m]) > 0.005
#         for: 5m
#         labels: { severity: critical }
#
#       - alert: WhisperGPUOverload
#         expr: avg(nvidia_gpu_utilization) > 90
#         for: 10m
#         labels: { severity: warning }
#         annotations:
#           summary: "GPU utilization > 90%, consider scaling up"
#
#       - alert: WhisperQueueBacklog
#         expr: whisper_queue_length > 50
#         for: 5m
#         labels: { severity: warning }

@dataclass
class AlertConfig:
    alert: str
    condition: str
    severity: str
    response: str
    auto_action: str

alerts = [
    AlertConfig("High Latency", "p99 > 30s for 5min", "P1 Critical",
        "Scale up GPU pods, check queue", "HPA auto-scale"),
    AlertConfig("High Error Rate", "> 0.5% errors for 5min", "P1 Critical",
        "Check GPU OOM, model loading, disk space", "Circuit breaker"),
    AlertConfig("GPU Overload", "> 90% utilization for 10min", "P2 Warning",
        "Scale up or optimize batch size", "HPA auto-scale"),
    AlertConfig("Queue Backlog", "> 50 jobs waiting for 5min", "P2 Warning",
        "Scale up workers, check processing speed", "HPA auto-scale"),
    AlertConfig("GPU OOM", "GPU memory > 95%", "P1 Critical",
        "Reduce batch size, restart pod", "Auto-restart"),
    AlertConfig("Pod Crash", "Container restart > 3 in 10min", "P1 Critical",
        "Check logs, GPU driver, model file", "Alert on-call"),
]

print("Alert Configuration:")
for a in alerts:
    print(f"  [{a.alert}] {a.condition}")
    print(f"    Severity: {a.severity}")
    print(f"    Response: {a.response}")
    print(f"    Auto: {a.auto_action}")

Incident Response

# === Incident Runbook ===

@dataclass
class Runbook:
    incident: str
    detection: str
    diagnosis: str
    mitigation: str
    prevention: str

runbooks = [
    Runbook("GPU Out of Memory",
        "Alert: GPU memory > 95%, pod OOMKilled",
        "kubectl logs pod | grep OOM; nvidia-smi on node",
        "Reduce BATCH_SIZE, restart pod; scale horizontally",
        "Set memory limits, use streaming for long audio"),
    Runbook("High Latency Spike",
        "Alert: p99 latency > 30s",
        "Check queue length, GPU util, concurrent requests",
        "Scale up HPA, enable request queuing with timeout",
        "Capacity planning, pre-warm instances before peak"),
    Runbook("Model Loading Failure",
        "Alert: pod not ready, health check fail",
        "kubectl describe pod; check model download, disk space",
        "Rollback to previous image, pre-cache model in PVC",
        "Use init container to download model, PVC cache"),
    Runbook("Audio Processing Error",
        "Alert: error rate > 0.5%",
        "Check error logs, audio format, ffmpeg issues",
        "Add input validation, fallback to smaller model",
        "Validate audio format before processing, retry logic"),
]

print("=== Incident Runbooks ===")
for r in runbooks:
    print(f"  [{r.incident}]")
    print(f"    Detect: {r.detection}")
    print(f"    Diagnose: {r.diagnosis}")
    print(f"    Mitigate: {r.mitigation}")
    print(f"    Prevent: {r.prevention}")

# Error Budget
budget = {
    "Monthly SLO": "99.9% availability = 43.2 minutes downtime allowed",
    "Current Month": "99.95% = 21.6 minutes used (50% budget remaining)",
    "Burn Rate": "Normal (< 1x)",
    "Action": "Continue feature work (budget healthy)",
    "If > 80% burned": "Freeze deploys, focus on reliability",
    "If > 100% burned": "All hands on reliability, no features",
}

print(f"\n\nError Budget:")
for k, v in budget.items():
    print(f"  [{k}]: {v}")

เคล็ดลับ

Whisper Speech คืออะไร

OpenAI Speech Recognition 99 ภาษา ไทย tiny base small medium large STT Translation Transformer Local GPU Faster-Whisper 4x

SRE สำหรับ Speech API ทำอะไร

Uptime 99.9% GPU Cluster Auto-scaling Monitoring Alert Latency Error Rate Incident Response Capacity Planning Performance Tuning Load Balancing

ตั้ง Monitoring อย่างไร

Latency p50 p95 p99 WER GPU Utilization 60-80% Queue Length Error Rate 0.1% Throughput Audio Minutes Prometheus Grafana PagerDuty SLO

จัดการ Incident อย่างไร

On-call 24/7 PagerDuty Runbook GPU OOM High Latency Queue Overflow P1 5 นาที P2 15 นาที Post-mortem Error Budget

สรุป

Whisper Speech SRE Infrastructure GPU Auto-scaling HPA Monitoring Prometheus Grafana Alert Incident Runbook Error Budget Capacity Planning Faster-Whisper Production

📖 บทความที่เกี่ยวข้อง

Whisper Speech Machine Learning Pipelineอ่านบทความ → Whisper Speech Message Queue Designอ่านบทความ → Whisper Speech DevOps Cultureอ่านบทความ → Whisper Speech Container Orchestrationอ่านบทความ → Whisper Speech Agile Scrum Kanbanอ่านบทความ →

📚 ดูบทความทั้งหมด →