TTS Coqui Machine Learning Pipeline —

Coqui TTS คืออะไรและใช้งานอย่างไร

Coqui TTS เป็น open source Text-to-Speech library ที่พัฒนาต่อจาก Mozilla TTS รองรับ models หลายแบบเช่น Tacotron2, VITS, GlowTTS, FastSpeech2 และ XTTS สามารถ synthesize เสียงพูดจาก text ได้หลายภาษา รวมถึง voice cloning ที่สร้างเสียงเลียนแบบจาก audio sample เพียงไม่กี่วินาที

XTTS (Cross-lingual TTS) เป็น model ล่าสุดของ Coqui ที่รองรับ 17 ภาษา สามารถ clone เสียงจาก reference audio 6 วินาที ให้เสียงที่เป็นธรรมชาติมาก รองรับ streaming output สำหรับ real-time applications

Machine Learning Pipeline สำหรับ TTS ประกอบด้วยขั้นตอนหลักคือ Data Collection ที่รวบรวม audio-text pairs, Data Preprocessing ที่ clean audio และ normalize text, Model Training ที่ train TTS model, Evaluation ที่วัดคุณภาพเสียง (MOS score), Model Serving ที่ deploy model สำหรับ inference และ Monitoring ที่ track quality metrics

Use cases ของ TTS ได้แก่ Audiobook generation, Virtual assistants, Accessibility tools สำหรับผู้พิการทางสายตา, Content creation (podcasts, videos), Customer service IVR systems และ Language learning applications

ติดตั้ง Coqui TTS และทดสอบ Models

ขั้นตอนการติดตั้งและเริ่มต้นใช้งาน

เนื้อหาเกี่ยวข้อง — อ่านต่อ: CSS Container Queries Team Productivity — คู่มือฉบับสมบูรณ์ 2026

# ติดตั้ง Coqui TTS
pip install TTS

# ตรวจสอบ version
tts --list_models

# === Command Line Usage ===

# Synthesize ด้วย default model
tts --text "Hello, this is a test of text to speech." \
    --out_path output.wav

# ใช้ specific model
tts --text "สวัสดีครับ นี่คือการทดสอบระบบแปลงข้อความเป็นเสียง" \
    --model_name tts_models/multilingual/multi-dataset/xtts_v2 \
    --language_idx th \
    --out_path thai_output.wav

# Voice cloning (ใช้ reference audio)
tts --text "This is voice cloning test" \
    --model_name tts_models/multilingual/multi-dataset/xtts_v2 \
    --speaker_wav reference_voice.wav \
    --language_idx en \
    --out_path cloned_output.wav

# List available models
tts --list_models | head -20

# === Python API ===
from TTS.api import TTS
import torch

# Check GPU
device = "cuda" if torch.cuda.is_available() else "cpu"

# Load XTTS v2 model
tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2").to(device)

# Basic synthesis
tts.tts_to_file(
    text="Hello world, this is Coqui TTS.",
    file_path="output.wav",
    language="en",
)

# Voice cloning
tts.tts_to_file(
    text="This sounds like the reference speaker.",
    speaker_wav="reference.wav",
    language="en",
    file_path="cloned.wav",
)

# Streaming (for real-time)
# chunks = tts.tts_stream(
#     text="This is streamed output.",
#     speaker_wav="reference.wav",
#     language="en",
# )
# for chunk in chunks:
#     audio_player.play(chunk)

# === Docker Setup ===
# docker run --rm -it -p 5002:5002 \
#   --gpus all \
#   ghcr.io/coqui-ai/tts \
#   --model_name tts_models/multilingual/multi-dataset/xtts_v2

# TTS Server API:
# POST http://localhost:5002/api/tts
# {"text": "Hello", "speaker_wav": "base64...", "language": "en"}

สร้าง ML Pipeline สำหรับ TTS Training

Pipeline สำหรับ training TTS model ตั้งแต่ data preparation ถึง evaluation

#!/usr/bin/env python3
# tts_pipeline.py — TTS Training Pipeline
import os
import json
import subprocess
import logging
from pathlib import Path
from dataclasses import dataclass
from typing import List, Optional
import librosa
import soundfile as sf
import numpy as np

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger("tts_pipeline")

@dataclass
class AudioSample:
    audio_path: str
    text: str
    speaker_id: str = "default"
    duration: float = 0.0

class DataPreprocessor:
    def __init__(self, output_dir="data/processed", sample_rate=22050):
        self.output_dir = Path(output_dir)
        self.sample_rate = sample_rate
        self.output_dir.mkdir(parents=True, exist_ok=True)
    
    def validate_audio(self, audio_path):
        try:
            y, sr = librosa.load(audio_path, sr=None)
            duration = len(y) / sr
            
            if duration < 1.0:
                return False, "Too short (< 1s)"
            if duration > 15.0:
                return False, "Too long (> 15s)"
            
            rms = np.sqrt(np.mean(y**2))
            if rms < 0.01:
                return False, "Too quiet (silence)"
            
            return True, f"OK (duration={duration:.1f}s, rms={rms:.3f})"
        except Exception as e:
            return False, str(e)
    
    def normalize_audio(self, audio_path, output_path):
        y, sr = librosa.load(audio_path, sr=self.sample_rate)
        
        y = librosa.effects.trim(y, top_db=20)[0]
        
        target_rms = 0.1
        current_rms = np.sqrt(np.mean(y**2))
        if current_rms > 0:
            y = y * (target_rms / current_rms)
        
        y = np.clip(y, -1.0, 1.0)
        
        sf.write(output_path, y, self.sample_rate)
        return len(y) / self.sample_rate
    
    def prepare_dataset(self, samples: List[AudioSample]):
        valid_samples = []
        
        for i, sample in enumerate(samples):
            ok, msg = self.validate_audio(sample.audio_path)
            if not ok:
                logger.warning(f"Skipping {sample.audio_path}: {msg}")
                continue
            
            output_name = f"audio_{i:05d}.wav"
            output_path = self.output_dir / output_name
            
            duration = self.normalize_audio(sample.audio_path, str(output_path))
            
            valid_samples.append(AudioSample(
                audio_path=str(output_path),
                text=sample.text,
                speaker_id=sample.speaker_id,
                duration=duration,
            ))
            
            if (i + 1) % 100 == 0:
                logger.info(f"Processed {i+1}/{len(samples)}")
        
        metadata_path = self.output_dir / "metadata.csv"
        with open(metadata_path, "w", encoding="utf-8") as f:
            for s in valid_samples:
                f.write(f"{Path(s.audio_path).stem}|{s.text}|{s.text}\n")
        
        logger.info(f"Dataset ready: {len(valid_samples)}/{len(samples)} samples")
        return valid_samples

class TTSTrainer:
    def __init__(self, model_type="vits", output_dir="models/tts"):
        self.model_type = model_type
        self.output_dir = Path(output_dir)
        self.output_dir.mkdir(parents=True, exist_ok=True)
    
    def create_config(self, dataset_path, epochs=1000):
        config = {
            "model": self.model_type,
            "run_name": f"tts_{self.model_type}",
            "output_path": str(self.output_dir),
            "datasets": [{
                "formatter": "ljspeech",
                "meta_file_train": "metadata.csv",
                "path": str(dataset_path),
                "language": "th",
            }],
            "audio": {
                "sample_rate": 22050,
                "win_length": 1024,
                "hop_length": 256,
                "num_mels": 80,
                "mel_fmin": 0,
                "mel_fmax": 8000,
            },
            "training": {
                "epochs": epochs,
                "batch_size": 32,
                "eval_batch_size": 16,
                "lr": 0.0002,
                "print_step": 50,
                "eval_split_size": 0.05,
            },
        }
        
        config_path = self.output_dir / "config.json"
        with open(config_path, "w") as f:
            json.dump(config, f, indent=2)
        
        return config_path
    
    def train(self, config_path):
        cmd = [
            "python", "-m", "TTS.bin.train_tts",
            "--config_path", str(config_path),
        ]
        
        logger.info(f"Starting training: {' '.join(cmd)}")
        process = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
        
        for line in iter(process.stdout.readline, b""):
            logger.info(line.decode().strip())
        
        process.wait()
        return process.returncode == 0

# ใช้งาน
preprocessor = DataPreprocessor()
trainer = TTSTrainer(model_type="vits")

Fine-tune Model ด้วย Custom Dataset

Fine-tune XTTS v2 ด้วยเสียงของตัวเอง

แนะนำเพิ่มเติม — อีบุ๊กการลงทุน SiamCafeBook

#!/usr/bin/env python3
# finetune_xtts.py — Fine-tune XTTS v2 with Custom Voice
from TTS.tts.configs.xtts_config import XttsConfig
from TTS.tts.models.xtts import Xtts
from TTS.utils.manage import ModelManager
from TTS.tts.datasets import load_tts_samples
import torch
import json
from pathlib import Path

class XTTSFineTuner:
    def __init__(self, output_dir="models/xtts_finetuned"):
        self.output_dir = Path(output_dir)
        self.output_dir.mkdir(parents=True, exist_ok=True)
        self.device = "cuda" if torch.cuda.is_available() else "cpu"
    
    def prepare_dataset(self, audio_dir, metadata_file):
        """
        metadata.csv format (LJSpeech style):
        audio_0001|This is the transcription of the first audio.
        audio_0002|This is another transcription.
        
        Audio files in wav format, 22050 Hz, mono
        Minimum 2 hours of audio recommended
        """
        
        audio_files = list(Path(audio_dir).glob("*.wav"))
        print(f"Found {len(audio_files)} audio files")
        
        total_duration = 0
        for af in audio_files:
            import librosa
            y, sr = librosa.load(str(af), sr=None)
            total_duration += len(y) / sr
        
        print(f"Total audio duration: {total_duration/3600:.1f} hours")
        
        if total_duration < 600:
            print("WARNING: Less than 10 minutes of audio. Quality may be poor.")
        
        return {
            "audio_dir": str(audio_dir),
            "metadata": str(metadata_file),
            "total_duration_hours": total_duration / 3600,
            "num_samples": len(audio_files),
        }
    
    def create_training_config(self, dataset_info, epochs=50, batch_size=4):
        config = XttsConfig()
        config.model_dir = str(self.output_dir)
        config.output_path = str(self.output_dir)
        
        config.training = {
            "num_epochs": epochs,
            "batch_size": batch_size,
            "eval_batch_size": 2,
            "lr": 5e-6,
            "weight_decay": 1e-2,
            "grad_clip": 1.0,
        }
        
        config.datasets = [{
            "formatter": "ljspeech",
            "path": dataset_info["audio_dir"],
            "meta_file_train": dataset_info["metadata"],
            "language": "th",
        }]
        
        config_path = self.output_dir / "config.json"
        config.save_json(str(config_path))
        
        return config_path
    
    def evaluate(self, model_path, test_texts):
        print("Loading fine-tuned model...")
        config = XttsConfig()
        config.load_json(str(self.output_dir / "config.json"))
        
        model = Xtts.init_from_config(config)
        model.load_checkpoint(config, checkpoint_dir=str(self.output_dir))
        model.to(self.device)
        model.eval()
        
        results = []
        for i, text in enumerate(test_texts):
            output_path = self.output_dir / f"eval_{i:03d}.wav"
            
            outputs = model.synthesize(
                text,
                config,
                speaker_wav=str(self.output_dir / "reference.wav"),
                language="th",
            )
            
            import soundfile as sf
            sf.write(str(output_path), outputs["wav"], 22050)
            results.append(str(output_path))
            print(f"Generated: {output_path}")
        
        return results

# Training script
# python -m TTS.bin.train_tts \
#   --config_path models/xtts_finetuned/config.json \
#   --restore_path path/to/xtts_v2_checkpoint.pth

# Quick fine-tune with TTS API
from TTS.api import TTS

tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2")

# Fine-tune API (simplified)
# tts.tts_to_file(
#     text="ทดสอบเสียงภาษาไทย",
#     speaker_wav="my_voice.wav",
#     language="th",
#     file_path="output.wav"
# )

สร้าง TTS API Service สำหรับ Production

REST API สำหรับ TTS service

#!/usr/bin/env python3
# tts_api.py — Production TTS API Service
from fastapi import FastAPI, HTTPException, BackgroundTasks
from fastapi.responses import StreamingResponse, FileResponse
from pydantic import BaseModel, Field
from typing import Optional
import uvicorn
import torch
import io
import hashlib
import time
import logging
from pathlib import Path
from datetime import datetime

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger("tts_api")

app = FastAPI(title="Coqui TTS API", version="1.0")

# Global model (loaded once)
tts_model = None
CACHE_DIR = Path("cache")
CACHE_DIR.mkdir(exist_ok=True)

class TTSRequest(BaseModel):
    text: str = Field(..., min_length=1, max_length=5000)
    language: str = Field(default="en")
    speaker_wav_url: Optional[str] = None
    speed: float = Field(default=1.0, ge=0.5, le=2.0)
    format: str = Field(default="wav")

class TTSResponse(BaseModel):
    audio_url: str
    duration_seconds: float
    processing_time_ms: float
    cached: bool

def get_model():
    global tts_model
    if tts_model is None:
        from TTS.api import TTS
        device = "cuda" if torch.cuda.is_available() else "cpu"
        tts_model = TTS("tts_models/multilingual/multi-dataset/xtts_v2").to(device)
        logger.info(f"Model loaded on {device}")
    return tts_model

def get_cache_key(text, language, speed):
    content = f"{text}|{language}|{speed}"
    return hashlib.md5(content.encode()).hexdigest()

@app.on_event("startup")
async def startup():
    get_model()

@app.post("/synthesize")
async def synthesize(request: TTSRequest):
    start = time.time()
    
    cache_key = get_cache_key(request.text, request.language, request.speed)
    cache_path = CACHE_DIR / f"{cache_key}.wav"
    
    if cache_path.exists():
        import librosa
        duration = librosa.get_duration(filename=str(cache_path))
        return TTSResponse(
            audio_url=f"/audio/{cache_key}.wav",
            duration_seconds=round(duration, 2),
            processing_time_ms=round((time.time() - start) * 1000, 2),
            cached=True,
        )
    
    try:
        model = get_model()
        model.tts_to_file(
            text=request.text,
            language=request.language,
            file_path=str(cache_path),
        )
        
        import librosa
        duration = librosa.get_duration(filename=str(cache_path))
        
        processing_time = (time.time() - start) * 1000
        logger.info(f"Synthesized: {len(request.text)} chars, {duration:.1f}s audio, {processing_time:.0f}ms")
        
        return TTSResponse(
            audio_url=f"/audio/{cache_key}.wav",
            duration_seconds=round(duration, 2),
            processing_time_ms=round(processing_time, 2),
            cached=False,
        )
    except Exception as e:
        logger.error(f"Synthesis failed: {e}")
        raise HTTPException(500, f"Synthesis failed: {str(e)}")

@app.get("/audio/{filename}")
async def get_audio(filename: str):
    filepath = CACHE_DIR / filename
    if not filepath.exists():
        raise HTTPException(404, "Audio not found")
    return FileResponse(str(filepath), media_type="audio/wav")

@app.post("/synthesize/stream")
async def synthesize_stream(request: TTSRequest):
    try:
        model = get_model()
        
        buffer = io.BytesIO()
        model.tts_to_file(
            text=request.text,
            language=request.language,
            file_path=buffer,
        )
        buffer.seek(0)
        
        return StreamingResponse(buffer, media_type="audio/wav")
    except Exception as e:
        raise HTTPException(500, str(e))

@app.get("/health")
async def health():
    return {
        "status": "healthy",
        "model_loaded": tts_model is not None,
        "gpu_available": torch.cuda.is_available(),
        "cache_size": len(list(CACHE_DIR.glob("*.wav"))),
    }

@app.get("/metrics")
async def metrics():
    cache_files = list(CACHE_DIR.glob("*.wav"))
    cache_size_mb = sum(f.stat().st_size for f in cache_files) / (1024 * 1024)
    
    return {
        "cached_items": len(cache_files),
        "cache_size_mb": round(cache_size_mb, 2),
        "gpu_memory_used_mb": round(torch.cuda.memory_allocated() / 1024**2, 2) if torch.cuda.is_available() else 0,
    }

# Docker Compose:
# services:
#   tts-api:
#     build: .
#     ports: ["8000:8000"]
#     volumes:
#       - ./cache:/app/cache
#       - ./models:/app/models
#     deploy:
#       resources:
#         reservations:
#           devices:
#             - driver: nvidia
#               count: 1
#               capabilities: [gpu]

if __name__ == "__main__":
    uvicorn.run(app, host="0.0.0.0", port=8000)

Monitoring และ Model Versioning

ระบบ monitoring คุณภาพเสียงและ versioning

เนื้อหาเกี่ยวข้อง — แนะนำให้อ่าน React Suspense Machine Learning Pipeline

#!/usr/bin/env python3
# tts_monitor.py — TTS Quality Monitoring
import numpy as np
import librosa
from pathlib import Path
from datetime import datetime
import json
import logging

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger("tts_monitor")

class TTSQualityMonitor:
    def __init__(self):
        self.metrics_log = []
    
    def analyze_audio(self, audio_path):
        y, sr = librosa.load(audio_path, sr=22050)
        
        duration = len(y) / sr
        rms = np.sqrt(np.mean(y**2))
        
        spectral_centroid = np.mean(librosa.feature.spectral_centroid(y=y, sr=sr))
        spectral_bandwidth = np.mean(librosa.feature.spectral_bandwidth(y=y, sr=sr))
        zero_crossing_rate = np.mean(librosa.feature.zero_crossing_rate(y))
        
        mfccs = librosa.feature.mfcc(y=y, sr=sr, n_mfcc=13)
        mfcc_mean = np.mean(mfccs, axis=1).tolist()
        
        silence_threshold = 0.01
        is_silence = np.abs(y) < silence_threshold
        silence_ratio = np.sum(is_silence) / len(y)
        
        metrics = {
            "duration_s": round(duration, 2),
            "rms_energy": round(float(rms), 4),
            "spectral_centroid": round(float(spectral_centroid), 2),
            "spectral_bandwidth": round(float(spectral_bandwidth), 2),
            "zero_crossing_rate": round(float(zero_crossing_rate), 4),
            "silence_ratio": round(float(silence_ratio), 4),
            "mfcc_mean": [round(m, 4) for m in mfcc_mean],
        }
        
        return metrics
    
    def check_quality(self, metrics):
        issues = []
        
        if metrics["duration_s"] < 0.5:
            issues.append("Audio too short (< 0.5s)")
        
        if metrics["rms_energy"] < 0.01:
            issues.append("Audio too quiet")
        elif metrics["rms_energy"] > 0.5:
            issues.append("Audio possibly clipping")
        
        if metrics["silence_ratio"] > 0.5:
            issues.append(f"Too much silence ({metrics['silence_ratio']:.0%})")
        
        if metrics["spectral_centroid"] < 500:
            issues.append("Low spectral centroid (muffled audio)")
        
        return {
            "passed": len(issues) == 0,
            "issues": issues,
            "score": max(0, 1.0 - len(issues) * 0.25),
        }
    
    def compare_with_reference(self, generated_path, reference_path):
        gen_metrics = self.analyze_audio(generated_path)
        ref_metrics = self.analyze_audio(reference_path)
        
        gen_mfcc = np.array(gen_metrics["mfcc_mean"])
        ref_mfcc = np.array(ref_metrics["mfcc_mean"])
        mfcc_distance = np.sqrt(np.sum((gen_mfcc - ref_mfcc)**2))
        
        energy_diff = abs(gen_metrics["rms_energy"] - ref_metrics["rms_energy"])
        centroid_diff = abs(gen_metrics["spectral_centroid"] - ref_metrics["spectral_centroid"])
        
        similarity = max(0, 1.0 - mfcc_distance / 100)
        
        return {
            "mfcc_distance": round(float(mfcc_distance), 4),
            "energy_difference": round(float(energy_diff), 4),
            "centroid_difference": round(float(centroid_diff), 2),
            "similarity_score": round(float(similarity), 4),
        }
    
    def batch_evaluate(self, audio_dir, reference_path=None):
        audio_files = sorted(Path(audio_dir).glob("*.wav"))
        results = []
        
        for af in audio_files:
            metrics = self.analyze_audio(str(af))
            quality = self.check_quality(metrics)
            
            result = {
                "file": af.name,
                "metrics": metrics,
                "quality": quality,
            }
            
            if reference_path:
                comparison = self.compare_with_reference(str(af), reference_path)
                result["comparison"] = comparison
            
            results.append(result)
        
        passed = sum(1 for r in results if r["quality"]["passed"])
        avg_score = np.mean([r["quality"]["score"] for r in results])
        
        summary = {
            "total_files": len(results),
            "passed": passed,
            "failed": len(results) - passed,
            "avg_quality_score": round(float(avg_score), 4),
            "timestamp": datetime.utcnow().isoformat(),
        }
        
        print(f"\nTTS Quality Report")
        print(f"Total: {summary['total_files']}, Pass: {passed}, Fail: {summary['failed']}")
        print(f"Avg Score: {summary['avg_quality_score']:.2f}")
        
        return {"summary": summary, "details": results}

monitor = TTSQualityMonitor()
# result = monitor.batch_evaluate("output_audio/", "reference.wav")

FAQ คำถามที่พบบ่อย

Q: Coqui TTS รองรับภาษาไทยไหม?

A: XTTS v2 รองรับภาษาไทย สามารถ synthesize เสียงภาษาไทยได้ แต่คุณภาพอาจไม่เทียบเท่าภาษาอังกฤษเพราะ training data ภาษาไทยน้อยกว่า สำหรับคุณภาพที่ดีขึ้น ควร fine-tune ด้วย Thai dataset เพิ่มเติม ใช้ audio อย่างน้อย 2-5 ชั่วโมงสำหรับ fine-tuning

Q: ต้องใช้ GPU ไหม?

แนะนำเพิ่มเติม — XM Signal

A: สำหรับ inference (สร้างเสียง) ใช้ CPU ได้แต่ช้ามาก (10-30x) GPU ทำให้ real-time synthesis เป็นไปได้ แนะนำ NVIDIA GPU ที่มี VRAM อย่างน้อย 4GB สำหรับ inference 8GB สำหรับ fine-tuning สำหรับ training จาก scratch ต้องใช้ GPU VRAM 16GB ขึ้นไป เช่น A100 หรือ V100

เนื้อหาเกี่ยวข้อง — ทำความเข้าใจ mTLS Service Mesh Feature Flag Management —

Q: Voice cloning ต้องใช้ audio นานแค่ไหน?

A: XTTS v2 สามารถ clone เสียงจาก audio เพียง 6 วินาที แต่คุณภาพจะดีขึ้นมากถ้าใช้ 30 วินาทีถึง 2 นาที audio ที่ใช้ควรเป็นเสียงพูดที่ชัดเจน ไม่มี background noise ไม่มีเสียงดนตรี และเป็น mono channel สำหรับ production ที่ต้องการคุณภาพสูง ควร fine-tune model ด้วย audio 2-5 ชั่วโมง

Q: Coqui TTS กับ ElevenLabs ต่างกันอย่างไร?

A: ElevenLabs เป็น commercial service มีคุณภาพเสียงดีมาก ใช้งานง่ายผ่าน API แต่มีค่าใช้จ่ายต่อ character Coqui TTS เป็น open source ฟรี self-host ได้ ข้อมูลไม่ออกจากเครื่อง customizable สูง แต่ต้องจัดการ infrastructure เอง คุณภาพเสียงอาจต่ำกว่าเล็กน้อย สำหรับ production ที่มี volume สูงหรือต้องการ privacy Coqui คุ้มค่ากว่า

เนื้อหาเกี่ยวข้อง — ดูเพิ่มเติมเรื่อง หุ่นเราเป็นแบบไหนแบบทดสอบ

Q: Legal considerations ของ voice cloning มีอะไรบ้าง?

A: การ clone เสียงผู้อื่นโดยไม่ได้รับอนุญาตอาจผิดกฎหมาย ต้องได้ consent จากเจ้าของเสียง ห้ามใช้เพื่อ impersonation หรือ fraud ในไทย พ. ร. บ. คอมพิวเตอร์ครอบคลุมการใช้ข้อมูลส่วนบุคคลในทางมิชอบ PDPA คุ้มครอง biometric data รวมถึงเสียง ควรมี disclaimer ว่าเสียงเป็น AI-generated เสมอ