VibeVoice TTS
Microsoft's multi-speaker long-form model that generates up to 90 minutes with 4 distinct speakers.
VibeVoice from Microsoft is built for long-form, multi-speaker audio. Its 1.5B model can generate up to 90 minutes of speech with as many as 4 simultaneous speakers, using speaker tags to drive multi-turn dialogue — a strong fit for podcasts, audiobooks, and conversations that need speaker consistency across long passages. A separate Realtime 0.5B variant reaches roughly 300ms latency for interactive use. On TTS.ai it covers English and Chinese and accepts up to 50,000 characters per request, so an entire episode can be scripted in one pass.
A colpo d'occhio
- Sviluppatore
- Microsoft
- Licenza
- MIT
- Livello
- standard
- Velocità
- fast
- Clonazione vocale
- No.
- Lingue
- English, Chinese
- Caratteri massimi
- 50000
VibeVoice voci
Meglio per
Podcasts, dialogues, long-form narration, multi-speaker content