Ming-Omni TTS

Ming-Omni TTS TTS

A compact 0.5B omni-modal speech model with near-CD-quality 44.1kHz output and zero-shot voice cloning.

Ming-omni-tts-0.5B by inclusionAI is a compact omni-modal speech model built on the BailingMM dense backbone with a patch-by-patch flow-matching audio decoder. Despite its small 500M-parameter size, it outputs 44.1kHz audio approaching CD quality and supports zero-shot voice cloning from a reference of three seconds or more. It includes built-in emotion, dialect, and even background-music control driven by JSON instructions, and is notably stable — reporting a 0.83% word error rate on Chinese benchmarks. With Apache 2.0 licensing and modest 3GB VRAM needs, it fits high-fidelity bilingual narration, emotion-controlled voice acting, and Chinese audiobook production.

At a glance

Developer
inclusionAI
License
Apache 2.0
Tier
free
Speed
medium
Voice cloning
Yes
Languages
English, Chinese
Max characters
1000

Ming-Omni TTS AI Voices

Default

English
Rhydd Neutral
Defnyddio

Default (Chinese)

Chinese
Rhydd Neutral
Defnyddio

Best for

High-fidelity bilingual narration, emotion-controlled voice acting, Chinese audiobook content

Ming-Omni TTS TTS — FAQ

It outputs 44.1kHz audio, close to CD quality — high for a model of only 0.5B parameters — thanks to its patch-by-patch flow-matching audio decoder.

Beyond voice cloning, it supports emotion, dialect, and background-music control via JSON instructions, and it is very stable, reporting a 0.83% word error rate on Chinese benchmarks.

English and Chinese, with zero-shot voice cloning from a reference clip of three seconds or longer.
← All voices