VITS TTS
The end-to-end TTS architecture that combines a variational autoencoder, normalizing flows, and adversarial training.
VITS — Variational Inference with adversarial learning for end-to-end Text-to-Speech — was introduced by Jaehyeon Kim and collaborators in 2021 and became a foundational architecture for modern neural speech. Rather than the older two-stage pipeline, it synthesizes audio in a single parallel end-to-end pass, pairing a variational autoencoder with normalizing flows and a GAN-style adversarial training process to lift naturalness. At about 25M parameters and trained on ~585 hours, it produces natural prosody at fast inference speeds and supports multiple speakers. It serves as a solid general-purpose, free baseline and underpins many later models such as Piper and MeloTTS.
De un vistazo
- Desarrollador
- Jaehyeon Kim et al.
- Licencia
- MIT
- Nivel
- free
- Velocidad
- fast
- Clonación de voz
- No
- Idiomas
- English, German, Spanish, French, Portuguese, Dutch, Finnish, Hungarian, Bulgarian, Japanese, Polish
- Máx. caracteres
- 2000
VITS voces
Lo mejor para
General-purpose text-to-speech with natural prosody