VITS TTS
The end-to-end TTS architecture that combines a variational autoencoder, normalizing flows, and adversarial training.
VITS — Variational Inference with adversarial learning for end-to-end Text-to-Speech — was introduced by Jaehyeon Kim and collaborators in 2021 and became a foundational architecture for modern neural speech. Rather than the older two-stage pipeline, it synthesizes audio in a single parallel end-to-end pass, pairing a variational autoencoder with normalizing flows and a GAN-style adversarial training process to lift naturalness. At about 25M parameters and trained on ~585 hours, it produces natural prosody at fast inference speeds and supports multiple speakers. It serves as a solid general-purpose, free baseline and underpins many later models such as Piper and MeloTTS.
At a glance
- Developer
- Jaehyeon Kim et al.
- License
- MIT
- Tier
- free
- Speed
- fast
- Voice cloning
- No
- Languages
- English, German, Spanish, French, Portuguese, Dutch, Finnish, Hungarian, Bulgarian, Japanese, Polish
- Max characters
- 2000
VITS AI Voices
Best for
General-purpose text-to-speech with natural prosody