CosyVoice 2 TTS
Alibaba Tongyi Lab's streaming TTS reaching human-parity naturalness with near-zero latency and zero-shot cloning.
CosyVoice 2, from Alibaba's Tongyi Lab, was designed to make high-quality speech viable in real time. It uses a finite scalar quantization approach combined with flow matching to support streaming synthesis at extremely low latency, while reaching human-comparable naturalness that outperforms many commercial systems in subjective tests. Beyond quality, it offers zero-shot voice cloning from about 3 seconds of audio, cross-lingual synthesis, and fine-grained emotion control. Covering 8 languages with a 1,000-character cap, it's a strong fit for voice assistants, streaming TTS, and other real-time applications.
At a glance
- Developer
- Alibaba (Tongyi Lab)
- License
- Apache 2.0
- Tier
- standard
- Speed
- medium
- Voice cloning
- Yes
- Languages
- English, Chinese, Japanese, Korean, French, German, Italian, Spanish
- Max characters
- 1000
CosyVoice 2 voices
Best for
Real-time applications, streaming TTS, voice assistants