CosyVoice 2

CosyVoice 2 TTS

Alibaba Tongyi Lab's streaming TTS reaching human-parity naturalness with near-zero latency and zero-shot cloning.

CosyVoice 2, from Alibaba's Tongyi Lab, was designed to make high-quality speech viable in real time. It uses a finite scalar quantization approach combined with flow matching to support streaming synthesis at extremely low latency, while reaching human-comparable naturalness that outperforms many commercial systems in subjective tests. Beyond quality, it offers zero-shot voice cloning from about 3 seconds of audio, cross-lingual synthesis, and fine-grained emotion control. Covering 8 languages with a 1,000-character cap, it's a strong fit for voice assistants, streaming TTS, and other real-time applications.

At a glance

Developer
Alibaba (Tongyi Lab)
License
Apache 2.0
Tier
standard
Speed
medium
Voice cloning
Yes
Languages
English, Chinese, Japanese, Korean, French, German, Italian, Spanish
Max characters
1000

CosyVoice 2 AI Voices

Chinese Female

Chinese
ค่ามาตรฐาน Female
ใช้

Chinese Male

Chinese
ค่ามาตรฐาน Male
ใช้

English Female

English
ค่ามาตรฐาน Female
ใช้

English Male

English
ค่ามาตรฐาน Male
ใช้

French Female

French
ค่ามาตรฐาน Female
ใช้

German Female

German
ค่ามาตรฐาน Female
ใช้

Italian Female

Italian
ค่ามาตรฐาน Female
ใช้

Japanese Female

Japanese
ค่ามาตรฐาน Female
ใช้

Korean Female

Korean
ค่ามาตรฐาน Female
ใช้

Spanish Female

Spanish
ค่ามาตรฐาน Female
ใช้

Best for

Real-time applications, streaming TTS, voice assistants

CosyVoice 2 TTS — FAQ

Yes. CosyVoice 2 uses finite scalar quantization for streaming synthesis at very low latency, which is what makes it suitable for voice assistants and real-time applications.

Yes. It offers zero-shot voice cloning from roughly 3 seconds of reference audio, plus cross-lingual synthesis and emotion control.

Yes. CosyVoice 2 is Apache 2.0 licensed. It supports 8 languages: English, Chinese, Japanese, Korean, French, German, Italian, and Spanish.
← All voices