IndexTTS-2

IndexTTS-2 TTS

A zero-shot TTS model with fine-grained emotion control via emotion vectors, no emotion-specific training data required.

IndexTTS-2, from the Index Team, is an expressive text-to-speech system that pairs zero-shot voice synthesis with precise emotional control. Rather than relying on emotion-labeled training data, it uses emotion vectors to dial in tones like happy, sad, angry, or fearful independently of the voice itself. Built on a Qwen2 backbone with BigVGAN as the vocoder, it supports English and Chinese and can clone a voice from roughly five seconds of reference audio. It suits audiobooks, virtual assistants, and any content where the same voice needs to shift emotional register. Its weights use the Bilibili Model License, which permits commercial use below large usage and revenue thresholds.

A colpo d'occhio

Sviluppatore
Index Team
Licenza
Bilibili Model License
Livello
standard
Velocità
medium
Clonazione vocale
Lingue
English, Chinese
Caratteri massimi
1000

IndexTTS-2 voci

Chinese Default

Chinese
Standard Neutral
Uso

Default

English
Standard Neutral
Uso

Meglio per

Emotionally expressive content, audiobooks, virtual assistants

IndexTTS-2 FAQ del TTS

It uses emotion vectors that let you specify tones such as happy, sad, angry, or fearful without needing emotion-specific training data, and the emotional expression is controlled independently from the voice identity.

Yes. It performs zero-shot voice cloning from a short reference, typically around five seconds of audio, in English or Chinese.

Its weights are released under the Bilibili Model License, which allows commercial use for products below defined user and revenue thresholds. Larger deployments should review the license terms.
← Tutte le voci