Pocket TTS

Pocket TTS TTS

A compact 100M-parameter CPU model from Kyutai (makers of Moshi) with single-sample voice cloning.

Pocket TTS comes from Kyutai, the lab behind the Moshi speech model, and is built around a transformer paired with the Mimi codec. At just 100M parameters it runs efficiently on CPU, yet it still supports zero-shot voice cloning from a single audio sample — an unusual feature at this size. It covers English and French and handles up to 1,000 characters per request at fast (~2s) speeds. The small footprint and ~1GB VRAM make it a natural fit for edge deployment and low-resource or CPU-only environments where quick voice cloning is needed.

At a glance

Developer
Kyutai
License
MIT
Tier
free
Speed
fast
Voice cloning
Yes
Languages
English, French
Max characters
1000

Pocket TTS AI Voices

Alba

English
Avgiftsfri Female
Användning

Azelma

English
Avgiftsfri Female
Användning

Cosette

English
Avgiftsfri Female
Användning

Eponine

English
Avgiftsfri Female
Användning

Fantine

English
Avgiftsfri Female
Användning

Fantine (French)

French
Avgiftsfri Female
Användning

Javert

English
Avgiftsfri Male
Användning

Jean

English
Avgiftsfri Male
Användning

Jean (French)

French
Avgiftsfri Male
Användning

Marius

English
Avgiftsfri Male
Användning

Best for

Lightweight deployment, CPU-only environments, quick voice cloning

Pocket TTS TTS — FAQ

Yes. Pocket TTS does zero-shot voice cloning from a single reference sample (about 3 seconds), which is notable for a model this small.

Yes. At 100M parameters it runs efficiently on CPU and needs only about 1GB VRAM if a GPU is used, making it well suited to edge and low-resource deployment.

Yes. Pocket TTS is MIT-licensed and in the free tier. It supports English and French.
← All voices