Speech & TTS
Two adjacent problems:
- ASR (Automatic Speech Recognition): audio → text.
- TTS (Text-to-Speech): text → audio.
And increasingly, end-to-end voice agents that combine both with an LLM in real time.
ASR
Whisper
OpenAI’s Whisper (2022) was the breakthrough — open-weights, multi-lingual, robust to noise and accents. Still the de facto default in 2026.
import whisper
model = whisper.load_model("large-v3")
result = model.transcribe("audio.mp3", language="en")
print(result["text"])
Variants:
- Whisper Large v3 / v3 Turbo: best quality, ~1.5GB.
- Whisper Tiny / Base / Small / Medium: progressively smaller, faster, less accurate.
- Whisper.cpp: C++ implementation, runs on CPU and edge.
- Faster-Whisper (CTranslate2-based): 4× faster GPU inference, same model.
Modern ASR contenders (2025+)
- NVIDIA Parakeet: open, very accurate.
- Distil-Whisper: distilled, faster.
- AssemblyAI Universal-2: commercial, strong on real-world audio.
- Deepgram Nova 3: commercial, low-latency.
- AWS Transcribe, Azure Speech: enterprise options.
For most teams: Whisper Large v3 Turbo for batch, faster-whisper for serving, commercial APIs for speed-critical real-time.
Streaming ASR
Real-time transcription requires processing audio incrementally — Whisper is batch-mode by default. For streaming:
- whisper-streaming: open-source streaming wrapper.
- NVIDIA Riva / Deepgram / AssemblyAI: streaming-optimized.
Latency targets: sub-500ms is “real-time”; 1–3s is acceptable for many UX.
Speaker diarization
“Who said what” — useful for meeting transcription:
- pyannote.audio: open, popular.
- NeMo Speaker Diarization (NVIDIA).
- AssemblyAI, Deepgram: commercial integrated diarization.
Often combined with ASR: transcribe → align → diarize.
TTS
Older TTS sounded robotic (concatenative or HMM-based). Modern neural TTS sounds human.
Architectures
- Tacotron / FastSpeech: predict mel-spectrograms; convert to audio with a vocoder.
- VALL-E / Voicebox (Microsoft, Meta): autoregressive on audio tokens.
- Diffusion TTS: NaturalSpeech, etc.
- Style-conditioned TTS: voice cloning, prosody control.
By 2026, the dominant pattern is discrete audio token prediction (codec-based) by an autoregressive transformer.
Frontier TTS (early 2026)
Closed
- OpenAI TTS (
tts-1-hd, voice modes via GPT-4o realtime). - ElevenLabs: industry-leading voice cloning and quality.
- PlayHT, Resemble AI: commercial alternatives.
- Google Cloud TTS: enterprise integration.
- Anthropic voice (Claude): paired with their realtime API.
Open
- Kokoro-82M: tiny, decent quality.
- OpenVoice v2: voice cloning, multilingual.
- F5-TTS: open, high quality.
- CosyVoice 2 (Alibaba): strong multilingual.
- Higgs Audio: open multimodal speech.
- Sesame CSM: open conversational speech.
Voice cloning
Modern TTS can clone a voice from a few seconds of reference audio:
clone = tts.clone_voice("sample_30s.wav")
output = tts.synthesize("Hello world", voice=clone)
Used for personalization, accessibility, audiobooks.
Ethical concerns:
- Deepfakes; impersonation; fraud.
- Consent: cloning a voice without permission.
- Detection: SynthID-Audio, watermarking schemes.
Most reputable providers require consent confirmation for voice cloning.
Real-time voice agents
Combining ASR + LLM + TTS in real time:
User speaks → ASR (streaming) → LLM (streaming) → TTS (streaming) → user hears
↑___________________________________________________|
(interrupt detection, turn-taking)
Critical for low latency:
- Interruption handling (user starts speaking while AI is talking).
- Streaming everywhere — don’t wait for full sentences.
- VAD (voice activity detection) to know when the user is done.
- Phrase-level TTS to start audio output as soon as a sentence is ready.
Native end-to-end voice models
The new frontier: models that go directly audio → audio, skipping text in the middle:
- GPT-4o realtime / GPT-Realtime: native multimodal, ~300ms latency.
- Gemini Live: similar.
- Sesame Conversational SSM.
Benefits:
- Lower latency.
- Captures non-verbal cues (laughter, hesitation, tone).
- Naturally handles interruptions.
Drawbacks:
- Less interpretable than text-mediated approach.
- Training data scarce for many languages.
Hybrid systems still dominate production today; native-multimodal is gaining ground.
Audio codecs
For modern TTS / audio LLMs, the model operates on discrete audio tokens from a neural codec:
- EnCodec (Meta): 1.5–24 kbps audio tokens.
- SoundStream (Google).
- DAC (Descript Audio Codec): high quality.
- WavLM tokens (research).
The codec encodes audio to a sequence of integers; the LM predicts; the decoder reconstructs audio.
This is the same recipe as language modeling, applied to audio.
Music generation
A close cousin:
- Suno, Udio: commercial music generation.
- MusicGen (Meta), MusicLM (Google): research.
- Stable Audio (Stability): open.
Same diffusion / autoregressive paradigms as image/video. Quality is wildly improved over 2022–2023.
Practical patterns
Chat with voice UX
For a customer-support voicebot:
- ASR streaming → text chunks.
- VAD detects end of user turn.
- Send accumulated text to LLM with conversation history.
- Stream LLM response → TTS → audio out.
- Detect user interruption; cut TTS, restart.
Off-the-shelf platforms: Vapi, Retell, Bland.ai, OpenAI Realtime API.
Meeting summarization
- Record audio (or pull from Zoom/Meet API).
- Whisper transcription with diarization.
- Send transcript to LLM with prompt for summary, action items.
- Output structured notes.
This is a “solved” problem in 2026 — many SaaS products do it well.
Audiobook generation
- Take a book / blog → split by paragraph.
- TTS each paragraph (with consistent voice).
- Add intro/outro.
- Optionally character voices for dialogue.
ElevenLabs, OpenAI TTS, F5-TTS all do this well.
Voice avatars
For digital humans / virtual presenters:
- TTS for the voice.
- Lip-sync model (Wav2Lip, EMO, HeyGen) for face animation.
- Vision model for the visual.
End-to-end: HeyGen, Synthesia, D-ID (production); EMO, Hallo (open).
Evaluation
ASR:
- Word Error Rate (WER): standard metric.
- CER (character error rate) for some languages.
TTS:
- MOS (Mean Opinion Score): human ratings of naturalness.
- MUSHRA, CMOS: more sensitive comparative tests.
For voice agents:
- End-to-end latency.
- Interruption handling success rate.
- Task completion rate.
Pitfalls
- Whisper hallucination on silence: it sometimes invents “thank you for watching” on quiet audio. Use silence detection.
- TTS prosody on unusual text: numbers, emails, code can be read awkwardly. Pre-process.
- Latency hidden in chunking: real-time latency comes from tons of small choices; profile end-to-end.
- Missing language support: many “multilingual” models are weak on low-resource languages.
- Background noise: real-world audio is noisy; train/eval reflect it.
See also
- Stage 06 — Transformers
- Stage 11 — Agents — voice agents are agents
- Stage 14 — Applications