VITS

VITS TTS

The end-to-end TTS architecture that combines a variational autoencoder, normalizing flows, and adversarial training.

VITS — Variational Inference with adversarial learning for end-to-end Text-to-Speech — was introduced by Jaehyeon Kim and collaborators in 2021 and became a foundational architecture for modern neural speech. Rather than the older two-stage pipeline, it synthesizes audio in a single parallel end-to-end pass, pairing a variational autoencoder with normalizing flows and a GAN-style adversarial training process to lift naturalness. At about 25M parameters and trained on ~585 hours, it produces natural prosody at fast inference speeds and supports multiple speakers. It serves as a solid general-purpose, free baseline and underpins many later models such as Piper and MeloTTS.

At a glance

Developer
Jaehyeon Kim et al.
License
MIT
Tier
free
Speed
fast
Voice cloning
No
Languages
English, German, Spanish, French, Portuguese, Dutch, Finnish, Hungarian, Bulgarian, Japanese, Polish
Max characters
2000

VITS AI Voices

CSS10 (Dutch)

Dutch
ነጻ Neutral
ጥቅም

CSS10 (Finnish)

Finnish
ነጻ Neutral
ጥቅም

CSS10 (French)

French
ነጻ Neutral
ጥቅም

CSS10 (German)

German
ነጻ Neutral
ጥቅም

CSS10 (Hungarian)

Hungarian
ነጻ Neutral
ጥቅም

CSS10 (Spanish)

Spanish
ነጻ Neutral
ጥቅም

Common Voice (Bulgarian)

Bulgarian
ነጻ Neutral
ጥቅም

Common Voice (Portuguese)

Portuguese
ነጻ Neutral
ጥቅም

Default

English
ነጻ Neutral
ጥቅም

MAI (Polish)

Polish
ነጻ Female
ጥቅም

MAI (Ukrainian)

Ukrainian
ነጻ Neutral
ጥቅም

Best for

General-purpose text-to-speech with natural prosody

VITS TTS — FAQ

VITS means Variational Inference with adversarial learning for end-to-end Text-to-Speech. It generates audio in a single parallel pass using a variational autoencoder, normalizing flows, and adversarial (GAN) training, rather than a two-stage pipeline.

Yes. VITS is MIT-licensed and in the free tier, so it can be used commercially.

On TTS.ai, VITS covers 11 languages including English, German, Spanish, French, Portuguese, Dutch, Finnish, Hungarian, Bulgarian, Japanese, and Polish, with multi-speaker support. It does not do voice cloning.
← All voices