IndexTTS-2

IndexTTS-2 TTS

A zero-shot TTS model with fine-grained emotion control via emotion vectors, no emotion-specific training data required.

IndexTTS-2, from the Index Team, is an expressive text-to-speech system that pairs zero-shot voice synthesis with precise emotional control. Rather than relying on emotion-labeled training data, it uses emotion vectors to dial in tones like happy, sad, angry, or fearful independently of the voice itself. Built on a Qwen2 backbone with BigVGAN as the vocoder, it supports English and Chinese and can clone a voice from roughly five seconds of reference audio. It suits audiobooks, virtual assistants, and any content where the same voice needs to shift emotional register. Its weights use the Bilibili Model License, which permits commercial use below large usage and revenue thresholds.

At a glance

Developer
Index Team
License
Bilibili Model License
Tier
standard
Speed
medium
Voice cloning
Yes
Languages
English, Chinese
Max characters
1000

IndexTTS-2 AI Voices

Chinese Default

Chinese
معياري Neutral
استعمال

Default

English
معياري Neutral
استعمال

Best for

Emotionally expressive content, audiobooks, virtual assistants

IndexTTS-2 TTS — FAQ

It uses emotion vectors that let you specify tones such as happy, sad, angry, or fearful without needing emotion-specific training data, and the emotional expression is controlled independently from the voice identity.

Yes. It performs zero-shot voice cloning from a short reference, typically around five seconds of audio, in English or Chinese.

Its weights are released under the Bilibili Model License, which allows commercial use for products below defined user and revenue thresholds. Larger deployments should review the license terms.
← All voices