Ming-Omni TTS TTS
A compact 0.5B omni-modal speech model with near-CD-quality 44.1kHz output and zero-shot voice cloning.
Ming-omni-tts-0.5B by inclusionAI is a compact omni-modal speech model built on the BailingMM dense backbone with a patch-by-patch flow-matching audio decoder. Despite its small 500M-parameter size, it outputs 44.1kHz audio approaching CD quality and supports zero-shot voice cloning from a reference of three seconds or more. It includes built-in emotion, dialect, and even background-music control driven by JSON instructions, and is notably stable — reporting a 0.83% word error rate on Chinese benchmarks. With Apache 2.0 licensing and modest 3GB VRAM needs, it fits high-fidelity bilingual narration, emotion-controlled voice acting, and Chinese audiobook production.
At a glance
- Developer
- inclusionAI
- License
- Apache 2.0
- Tier
- free
- Speed
- medium
- Voice cloning
- Yes
- Languages
- English, Chinese
- Max characters
- 1000
Ming-Omni TTS voices
Best for
High-fidelity bilingual narration, emotion-controlled voice acting, Chinese audiobook content