Report Bug / Feature Request

Ming-Omni TTS TTS

A compact 0.5B omni-modal speech model with near-CD-quality 44.1kHz output and zero-shot voice cloning.

Text
Files

0/500 characters · Sign up for 5,000 per generation →

SSML Mode (Speech Synthesis Markup Language for fine control)

Wrap your text in SSML tags for precise control:

<speak><prosody rate="slow">Slow speech</prosody></speak>

Emotion / Style Tags

Tags the selected model understands — click to drop one into your text where it happens:

Pronunciation Dictionary

Define custom pronunciations (word = pronunciation):

Pitch 0

-12 +12

AI Model

Voice

Language

Output Format

Speed 1.0x

0.5x 2.0x

Free with Piper, VITS, MeloTTS

Your generated audio will appear here. Choose a model, enter text, and click Generate.

About Ming-Omni TTS

Ming-omni-tts-0.5B by inclusionAI is a compact omni-modal speech model built on the BailingMM dense backbone with a patch-by-patch flow-matching audio decoder. Despite its small 500M-parameter size, it outputs 44.1kHz audio approaching CD quality and supports zero-shot voice cloning from a reference of three seconds or more. It includes built-in emotion, dialect, and even background-music control driven by JSON instructions, and is notably stable — reporting a 0.83% word error rate on Chinese benchmarks. With Apache 2.0 licensing and modest 3GB VRAM needs, it fits high-fidelity bilingual narration, emotion-controlled voice acting, and Chinese audiobook production.

Best for: High-fidelity bilingual narration, emotion-controlled voice acting, Chinese audiobook content

Browse all Ming-Omni TTS voices

At a glance

Developer: inclusionAI
License: Apache 2.0
Tier: free
Speed: medium
Voice cloning: Yes
Languages: English, Chinese
Max characters: 1000

Ming-Omni TTS voices

Default

English

Free Neutral

Default (Chinese)

Chinese

Free Neutral

Ming-Omni TTS TTS — FAQ

It outputs 44.1kHz audio, close to CD quality — high for a model of only 0.5B parameters — thanks to its patch-by-patch flow-matching audio decoder.

Beyond voice cloning, it supports emotion, dialect, and background-music control via JSON instructions, and it is very stable, reporting a 0.83% word error rate on Chinese benchmarks.

English and Chinese, with zero-shot voice cloning from a reference clip of three seconds or longer.

← All voices

Ming-Omni TTS TTS

Love TTS.ai? Tell your friends!

About Ming-Omni TTS

At a glance

Ming-Omni TTS voices

Default

Default (Chinese)

Ming-Omni TTS TTS — FAQ

What audio quality does Ming-Omni TTS produce?

What control does Ming-Omni TTS offer?

Which languages does Ming-Omni TTS support?