VoxCPM

VoxCPM TTS

A tokenizer-free TTS model that works in continuous space, outputs 44.1kHz audio, and stays consistent across paragraphs.

VoxCPM 1.5 by OpenBMB takes an unusual approach: instead of converting speech into discrete tokens, it operates directly in continuous space, which helps it preserve fine acoustic detail. It produces high-fidelity 44.1kHz audio, supports zero-shot voice cloning from three to ten seconds of reference, and maintains a consistent voice across long passages — a common failure point for other models on multi-paragraph text. Its cross-language cloning lets an English reference voice speak Chinese and vice versa. With Apache 2.0 licensing and LoRA fine-tuning support, it is well suited to audiobooks and long-form content where voice consistency over many paragraphs is essential.

A colpo d'occhio

Sviluppatore
OpenBMB
Licenza
Apache 2.0
Livello
standard
Velocità
fast
Clonazione vocale
Lingue
English, Chinese
Caratteri massimi
2000

VoxCPM voci

Default

English
Standard Neutral
Uso

Default Chinese

Chinese
Standard Neutral
Uso

Meglio per

High-fidelity audio, audiobooks, long-form content with voice consistency

VoxCPM FAQ del TTS

Rather than discretizing speech into tokens, VoxCPM models audio in continuous space using flow matching. This helps it retain subtle acoustic detail and produce clean 44.1kHz output.

Yes. It is specifically designed to keep the voice consistent across paragraphs, which makes it well suited to audiobooks and other long passages where other models tend to drift.

Yes. It supports cross-lingual cloning between English and Chinese — for example applying an English reference voice to Chinese speech — from three to ten seconds of audio.
← Tutte le voci