Speech & TTS

Two adjacent problems:

ASR (Automatic Speech Recognition): audio → text.
TTS (Text-to-Speech): text → audio.

And increasingly, end-to-end voice agents that combine both with an LLM in real time.

ASR

Whisper

OpenAI’s Whisper (2022) was the breakthrough — open-weights, multi-lingual, robust to noise and accents. Still the de facto default in 2026.

import whisper
model = whisper.load_model("large-v3")
result = model.transcribe("audio.mp3", language="en")
print(result["text"])

Variants:

Whisper Large v3 / v3 Turbo: best quality, ~1.5GB.
Whisper Tiny / Base / Small / Medium: progressively smaller, faster, less accurate.
Whisper.cpp: C++ implementation, runs on CPU and edge.
Faster-Whisper (CTranslate2-based): 4× faster GPU inference, same model.

Modern ASR contenders (2025+)

NVIDIA Parakeet: open, very accurate.
Distil-Whisper: distilled, faster.
AssemblyAI Universal-2: commercial, strong on real-world audio.
Deepgram Nova 3: commercial, low-latency.
AWS Transcribe, Azure Speech: enterprise options.

For most teams: Whisper Large v3 Turbo for batch, faster-whisper for serving, commercial APIs for speed-critical real-time.

Streaming ASR

Real-time transcription requires processing audio incrementally — Whisper is batch-mode by default. For streaming:

whisper-streaming: open-source streaming wrapper.
NVIDIA Riva / Deepgram / AssemblyAI: streaming-optimized.

Latency targets: sub-500ms is “real-time”; 1–3s is acceptable for many UX.

Speaker diarization

“Who said what” — useful for meeting transcription:

pyannote.audio: open, popular.
NeMo Speaker Diarization (NVIDIA).
AssemblyAI, Deepgram: commercial integrated diarization.

Often combined with ASR: transcribe → align → diarize.

TTS

Older TTS sounded robotic (concatenative or HMM-based). Modern neural TTS sounds human.

Architectures

Tacotron / FastSpeech: predict mel-spectrograms; convert to audio with a vocoder.
VALL-E / Voicebox (Microsoft, Meta): autoregressive on audio tokens.
Diffusion TTS: NaturalSpeech, etc.
Style-conditioned TTS: voice cloning, prosody control.

By 2026, the dominant pattern is discrete audio token prediction (codec-based) by an autoregressive transformer.

Frontier TTS (early 2026)

Closed

OpenAI TTS (tts-1-hd, voice modes via GPT-4o realtime).
ElevenLabs: industry-leading voice cloning and quality.
PlayHT, Resemble AI: commercial alternatives.
Google Cloud TTS: enterprise integration.
Anthropic voice (Claude): paired with their realtime API.

Open

Kokoro-82M: tiny, decent quality.
OpenVoice v2: voice cloning, multilingual.
F5-TTS: open, high quality.
CosyVoice 2 (Alibaba): strong multilingual.
Higgs Audio: open multimodal speech.
Sesame CSM: open conversational speech.

Voice cloning

Modern TTS can clone a voice from a few seconds of reference audio:

clone = tts.clone_voice("sample_30s.wav")
output = tts.synthesize("Hello world", voice=clone)

Used for personalization, accessibility, audiobooks.

Ethical concerns:

Deepfakes; impersonation; fraud.
Consent: cloning a voice without permission.
Detection: SynthID-Audio, watermarking schemes.

Most reputable providers require consent confirmation for voice cloning.

Real-time voice agents

Combining ASR + LLM + TTS in real time:

User speaks → ASR (streaming) → LLM (streaming) → TTS (streaming) → user hears
                ↑___________________________________________________|
            (interrupt detection, turn-taking)

Critical for low latency:

Interruption handling (user starts speaking while AI is talking).
Streaming everywhere — don’t wait for full sentences.
VAD (voice activity detection) to know when the user is done.
Phrase-level TTS to start audio output as soon as a sentence is ready.

Native end-to-end voice models

The new frontier: models that go directly audio → audio, skipping text in the middle:

GPT-4o realtime / GPT-Realtime: native multimodal, ~300ms latency.
Gemini Live: similar.
Sesame Conversational SSM.

Benefits:

Lower latency.
Captures non-verbal cues (laughter, hesitation, tone).
Naturally handles interruptions.

Drawbacks:

Less interpretable than text-mediated approach.
Training data scarce for many languages.

Hybrid systems still dominate production today; native-multimodal is gaining ground.

Audio codecs

For modern TTS / audio LLMs, the model operates on discrete audio tokens from a neural codec:

EnCodec (Meta): 1.5–24 kbps audio tokens.
SoundStream (Google).
DAC (Descript Audio Codec): high quality.
WavLM tokens (research).

The codec encodes audio to a sequence of integers; the LM predicts; the decoder reconstructs audio.

This is the same recipe as language modeling, applied to audio.

Music generation

A close cousin:

Suno, Udio: commercial music generation.
MusicGen (Meta), MusicLM (Google): research.
Stable Audio (Stability): open.

Same diffusion / autoregressive paradigms as image/video. Quality is wildly improved over 2022–2023.

Practical patterns

Chat with voice UX

For a customer-support voicebot:

ASR streaming → text chunks.
VAD detects end of user turn.
Send accumulated text to LLM with conversation history.
Stream LLM response → TTS → audio out.
Detect user interruption; cut TTS, restart.

Off-the-shelf platforms: Vapi, Retell, Bland.ai, OpenAI Realtime API.

Meeting summarization

Record audio (or pull from Zoom/Meet API).
Whisper transcription with diarization.
Send transcript to LLM with prompt for summary, action items.
Output structured notes.

This is a “solved” problem in 2026 — many SaaS products do it well.

Audiobook generation

Take a book / blog → split by paragraph.
TTS each paragraph (with consistent voice).
Add intro/outro.
Optionally character voices for dialogue.

ElevenLabs, OpenAI TTS, F5-TTS all do this well.

Voice avatars

For digital humans / virtual presenters:

TTS for the voice.
Lip-sync model (Wav2Lip, EMO, HeyGen) for face animation.
Vision model for the visual.

End-to-end: HeyGen, Synthesia, D-ID (production); EMO, Hallo (open).

Evaluation

ASR:

Word Error Rate (WER): standard metric.
CER (character error rate) for some languages.

TTS:

MOS (Mean Opinion Score): human ratings of naturalness.
MUSHRA, CMOS: more sensitive comparative tests.

For voice agents:

End-to-end latency.
Interruption handling success rate.
Task completion rate.

Pitfalls

Whisper hallucination on silence: it sometimes invents “thank you for watching” on quiet audio. Use silence detection.
TTS prosody on unusual text: numbers, emails, code can be read awkwardly. Pre-process.
Latency hidden in chunking: real-time latency comes from tons of small choices; profile end-to-end.
Missing language support: many “multilingual” models are weak on low-resource languages.
Background noise: real-world audio is noisy; train/eval reflect it.