Speech & TTS

Two adjacent problems:

  • ASR (Automatic Speech Recognition): audio → text.
  • TTS (Text-to-Speech): text → audio.

And increasingly, end-to-end voice agents that combine both with an LLM in real time.

ASR

Whisper

OpenAI’s Whisper (2022) was the breakthrough — open-weights, multi-lingual, robust to noise and accents. Still the de facto default in 2026.

import whisper
model = whisper.load_model("large-v3")
result = model.transcribe("audio.mp3", language="en")
print(result["text"])

Variants:

  • Whisper Large v3 / v3 Turbo: best quality, ~1.5GB.
  • Whisper Tiny / Base / Small / Medium: progressively smaller, faster, less accurate.
  • Whisper.cpp: C++ implementation, runs on CPU and edge.
  • Faster-Whisper (CTranslate2-based): 4× faster GPU inference, same model.

Modern ASR contenders (2025+)

  • NVIDIA Parakeet: open, very accurate.
  • Distil-Whisper: distilled, faster.
  • AssemblyAI Universal-2: commercial, strong on real-world audio.
  • Deepgram Nova 3: commercial, low-latency.
  • AWS Transcribe, Azure Speech: enterprise options.

For most teams: Whisper Large v3 Turbo for batch, faster-whisper for serving, commercial APIs for speed-critical real-time.

Streaming ASR

Real-time transcription requires processing audio incrementally — Whisper is batch-mode by default. For streaming:

  • whisper-streaming: open-source streaming wrapper.
  • NVIDIA Riva / Deepgram / AssemblyAI: streaming-optimized.

Latency targets: sub-500ms is “real-time”; 1–3s is acceptable for many UX.

Speaker diarization

“Who said what” — useful for meeting transcription:

  • pyannote.audio: open, popular.
  • NeMo Speaker Diarization (NVIDIA).
  • AssemblyAI, Deepgram: commercial integrated diarization.

Often combined with ASR: transcribe → align → diarize.

TTS

Older TTS sounded robotic (concatenative or HMM-based). Modern neural TTS sounds human.

Architectures

  • Tacotron / FastSpeech: predict mel-spectrograms; convert to audio with a vocoder.
  • VALL-E / Voicebox (Microsoft, Meta): autoregressive on audio tokens.
  • Diffusion TTS: NaturalSpeech, etc.
  • Style-conditioned TTS: voice cloning, prosody control.

By 2026, the dominant pattern is discrete audio token prediction (codec-based) by an autoregressive transformer.

Frontier TTS (early 2026)

Closed

  • OpenAI TTS (tts-1-hd, voice modes via GPT-4o realtime).
  • ElevenLabs: industry-leading voice cloning and quality.
  • PlayHT, Resemble AI: commercial alternatives.
  • Google Cloud TTS: enterprise integration.
  • Anthropic voice (Claude): paired with their realtime API.

Open

  • Kokoro-82M: tiny, decent quality.
  • OpenVoice v2: voice cloning, multilingual.
  • F5-TTS: open, high quality.
  • CosyVoice 2 (Alibaba): strong multilingual.
  • Higgs Audio: open multimodal speech.
  • Sesame CSM: open conversational speech.

Voice cloning

Modern TTS can clone a voice from a few seconds of reference audio:

clone = tts.clone_voice("sample_30s.wav")
output = tts.synthesize("Hello world", voice=clone)

Used for personalization, accessibility, audiobooks.

Ethical concerns:

  • Deepfakes; impersonation; fraud.
  • Consent: cloning a voice without permission.
  • Detection: SynthID-Audio, watermarking schemes.

Most reputable providers require consent confirmation for voice cloning.

Real-time voice agents

Combining ASR + LLM + TTS in real time:

User speaks → ASR (streaming) → LLM (streaming) → TTS (streaming) → user hears
                ↑___________________________________________________|
            (interrupt detection, turn-taking)

Critical for low latency:

  • Interruption handling (user starts speaking while AI is talking).
  • Streaming everywhere — don’t wait for full sentences.
  • VAD (voice activity detection) to know when the user is done.
  • Phrase-level TTS to start audio output as soon as a sentence is ready.

Native end-to-end voice models

The new frontier: models that go directly audio → audio, skipping text in the middle:

  • GPT-4o realtime / GPT-Realtime: native multimodal, ~300ms latency.
  • Gemini Live: similar.
  • Sesame Conversational SSM.

Benefits:

  • Lower latency.
  • Captures non-verbal cues (laughter, hesitation, tone).
  • Naturally handles interruptions.

Drawbacks:

  • Less interpretable than text-mediated approach.
  • Training data scarce for many languages.

Hybrid systems still dominate production today; native-multimodal is gaining ground.

Audio codecs

For modern TTS / audio LLMs, the model operates on discrete audio tokens from a neural codec:

  • EnCodec (Meta): 1.5–24 kbps audio tokens.
  • SoundStream (Google).
  • DAC (Descript Audio Codec): high quality.
  • WavLM tokens (research).

The codec encodes audio to a sequence of integers; the LM predicts; the decoder reconstructs audio.

This is the same recipe as language modeling, applied to audio.

Music generation

A close cousin:

  • Suno, Udio: commercial music generation.
  • MusicGen (Meta), MusicLM (Google): research.
  • Stable Audio (Stability): open.

Same diffusion / autoregressive paradigms as image/video. Quality is wildly improved over 2022–2023.

Practical patterns

Chat with voice UX

For a customer-support voicebot:

  1. ASR streaming → text chunks.
  2. VAD detects end of user turn.
  3. Send accumulated text to LLM with conversation history.
  4. Stream LLM response → TTS → audio out.
  5. Detect user interruption; cut TTS, restart.

Off-the-shelf platforms: Vapi, Retell, Bland.ai, OpenAI Realtime API.

Meeting summarization

  1. Record audio (or pull from Zoom/Meet API).
  2. Whisper transcription with diarization.
  3. Send transcript to LLM with prompt for summary, action items.
  4. Output structured notes.

This is a “solved” problem in 2026 — many SaaS products do it well.

Audiobook generation

  1. Take a book / blog → split by paragraph.
  2. TTS each paragraph (with consistent voice).
  3. Add intro/outro.
  4. Optionally character voices for dialogue.

ElevenLabs, OpenAI TTS, F5-TTS all do this well.

Voice avatars

For digital humans / virtual presenters:

  • TTS for the voice.
  • Lip-sync model (Wav2Lip, EMO, HeyGen) for face animation.
  • Vision model for the visual.

End-to-end: HeyGen, Synthesia, D-ID (production); EMO, Hallo (open).

Evaluation

ASR:

  • Word Error Rate (WER): standard metric.
  • CER (character error rate) for some languages.

TTS:

  • MOS (Mean Opinion Score): human ratings of naturalness.
  • MUSHRA, CMOS: more sensitive comparative tests.

For voice agents:

  • End-to-end latency.
  • Interruption handling success rate.
  • Task completion rate.

Pitfalls

  • Whisper hallucination on silence: it sometimes invents “thank you for watching” on quiet audio. Use silence detection.
  • TTS prosody on unusual text: numbers, emails, code can be read awkwardly. Pre-process.
  • Latency hidden in chunking: real-time latency comes from tons of small choices; profile end-to-end.
  • Missing language support: many “multilingual” models are weak on low-resource languages.
  • Background noise: real-world audio is noisy; train/eval reflect it.

See also