Pocket TTS

Pocket TTS TTS

A compact 100M-parameter CPU model from Kyutai (makers of Moshi) with single-sample voice cloning.

Pocket TTS comes from Kyutai, the lab behind the Moshi speech model, and is built around a transformer paired with the Mimi codec. At just 100M parameters it runs efficiently on CPU, yet it still supports zero-shot voice cloning from a single audio sample — an unusual feature at this size. It covers English and French and handles up to 1,000 characters per request at fast (~2s) speeds. The small footprint and ~1GB VRAM make it a natural fit for edge deployment and low-resource or CPU-only environments where quick voice cloning is needed.

At a glance

Developer
Kyutai
License
MIT
Tier
free
Speed
fast
Voice cloning
Yes
Languages
English, French
Max characters
1000

Pocket TTS AI Voices

Alba

English
Оқ Female
_Қўлланиш

Azelma

English
Оқ Female
_Қўлланиш

Cosette

English
Оқ Female
_Қўлланиш

Eponine

English
Оқ Female
_Қўлланиш

Fantine

English
Оқ Female
_Қўлланиш

Fantine (French)

French
Оқ Female
_Қўлланиш

Javert

English
Оқ Male
_Қўлланиш

Jean

English
Оқ Male
_Қўлланиш

Jean (French)

French
Оқ Male
_Қўлланиш

Marius

English
Оқ Male
_Қўлланиш

Best for

Lightweight deployment, CPU-only environments, quick voice cloning

Pocket TTS TTS — FAQ

Yes. Pocket TTS does zero-shot voice cloning from a single reference sample (about 3 seconds), which is notable for a model this small.

Yes. At 100M parameters it runs efficiently on CPU and needs only about 1GB VRAM if a GPU is used, making it well suited to edge and low-resource deployment.

Yes. Pocket TTS is MIT-licensed and in the free tier. It supports English and French.
← All voices