VibeVoice

VibeVoice TTS

Microsoft's multi-speaker long-form model that generates up to 90 minutes with 4 distinct speakers.

VibeVoice from Microsoft is built for long-form, multi-speaker audio. Its 1.5B model can generate up to 90 minutes of speech with as many as 4 simultaneous speakers, using speaker tags to drive multi-turn dialogue — a strong fit for podcasts, audiobooks, and conversations that need speaker consistency across long passages. A separate Realtime 0.5B variant reaches roughly 300ms latency for interactive use. On TTS.ai it covers English and Chinese and accepts up to 50,000 characters per request, so an entire episode can be scripted in one pass.

At a glance

Developer
Microsoft
License
MIT
Tier
standard
Speed
fast
Voice cloning
No
Languages
English, Chinese
Max characters
50000

VibeVoice AI Voices

Speaker 1

English
Standardne Neutral
Kasutamine

Speaker 1 (Chinese)

Chinese
Standardne Neutral
Kasutamine

Speaker 2

English
Standardne Neutral
Kasutamine

Speaker 2 (Chinese)

Chinese
Standardne Neutral
Kasutamine

Speaker 3

English
Standardne Neutral
Kasutamine

Speaker 4

English
Standardne Neutral
Kasutamine

Best for

Podcasts, dialogues, long-form narration, multi-speaker content

VibeVoice TTS — FAQ

VibeVoice supports up to 4 distinct speakers and up to 90 minutes of continuous output, with speaker tags for multi-turn dialogue — built for podcasts and long-form narration. It accepts up to 50,000 characters per request.

Yes. Alongside the 1.5B long-form model, a Realtime 0.5B variant achieves roughly 300ms latency for interactive use.

VibeVoice is MIT-licensed. It supports English and Chinese and does not currently support voice cloning.
← All voices