VibeVoice

VibeVoice TTS

Microsoft's multi-speaker long-form model that generates up to 90 minutes with 4 distinct speakers.

VibeVoice from Microsoft is built for long-form, multi-speaker audio. Its 1.5B model can generate up to 90 minutes of speech with as many as 4 simultaneous speakers, using speaker tags to drive multi-turn dialogue — a strong fit for podcasts, audiobooks, and conversations that need speaker consistency across long passages. A separate Realtime 0.5B variant reaches roughly 300ms latency for interactive use. On TTS.ai it covers English and Chinese and accepts up to 50,000 characters per request, so an entire episode can be scripted in one pass.

A colpo d'occhio

Sviluppatore
Microsoft
Licenza
MIT
Livello
standard
Velocità
fast
Clonazione vocale
No.
Lingue
English, Chinese
Caratteri massimi
50000

VibeVoice voci

Speaker 1

English
Standard Neutral
Uso

Speaker 1 (Chinese)

Chinese
Standard Neutral
Uso

Speaker 2

English
Standard Neutral
Uso

Speaker 2 (Chinese)

Chinese
Standard Neutral
Uso

Speaker 3

English
Standard Neutral
Uso

Speaker 4

English
Standard Neutral
Uso

Meglio per

Podcasts, dialogues, long-form narration, multi-speaker content

VibeVoice FAQ del TTS

VibeVoice supports up to 4 distinct speakers and up to 90 minutes of continuous output, with speaker tags for multi-turn dialogue — built for podcasts and long-form narration. It accepts up to 50,000 characters per request.

Yes. Alongside the 1.5B long-form model, a Realtime 0.5B variant achieves roughly 300ms latency for interactive use.

VibeVoice is MIT-licensed. It supports English and Chinese and does not currently support voice cloning.
← Tutte le voci