VoxCPM

VoxCPM TTS

A tokenizer-free TTS model that works in continuous space, outputs 44.1kHz audio, and stays consistent across paragraphs.

VoxCPM 1.5 by OpenBMB takes an unusual approach: instead of converting speech into discrete tokens, it operates directly in continuous space, which helps it preserve fine acoustic detail. It produces high-fidelity 44.1kHz audio, supports zero-shot voice cloning from three to ten seconds of reference, and maintains a consistent voice across long passages — a common failure point for other models on multi-paragraph text. Its cross-language cloning lets an English reference voice speak Chinese and vice versa. With Apache 2.0 licensing and LoRA fine-tuning support, it is well suited to audiobooks and long-form content where voice consistency over many paragraphs is essential.

At a glance

Developer
OpenBMB
License
Apache 2.0
Tier
standard
Speed
fast
Voice cloning
Yes
Languages
English, Chinese
Max characters
2000

VoxCPM AI Voices

Default

English
Kalender Neutral
Gebruik

Default Chinese

Chinese
Kalender Neutral
Gebruik

Best for

High-fidelity audio, audiobooks, long-form content with voice consistency

VoxCPM TTS — FAQ

Rather than discretizing speech into tokens, VoxCPM models audio in continuous space using flow matching. This helps it retain subtle acoustic detail and produce clean 44.1kHz output.

Yes. It is specifically designed to keep the voice consistent across paragraphs, which makes it well suited to audiobooks and other long passages where other models tend to drift.

Yes. It supports cross-lingual cloning between English and Chinese — for example applying an English reference voice to Chinese speech — from three to ten seconds of audio.
← All voices