VITS TTS
The end-to-end TTS architecture that combines a variational autoencoder, normalizing flows, and adversarial training.
VITS — Variational Inference with adversarial learning for end-to-end Text-to-Speech — was introduced by Jaehyeon Kim and collaborators in 2021 and became a foundational architecture for modern neural speech. Rather than the older two-stage pipeline, it synthesizes audio in a single parallel end-to-end pass, pairing a variational autoencoder with normalizing flows and a GAN-style adversarial training process to lift naturalness. At about 25M parameters and trained on ~585 hours, it produces natural prosody at fast inference speeds and supports multiple speakers. It serves as a solid general-purpose, free baseline and underpins many later models such as Piper and MeloTTS.
A colpo d'occhio
- Sviluppatore
- Jaehyeon Kim et al.
- Licenza
- MIT
- Livello
- free
- Velocità
- fast
- Clonazione vocale
- No.
- Lingue
- English, German, Spanish, French, Portuguese, Dutch, Finnish, Hungarian, Bulgarian, Japanese, Polish
- Caratteri massimi
- 2000
VITS voci
Meglio per
General-purpose text-to-speech with natural prosody