Pocket TTS

Pocket TTS TTS

A compact 100M-parameter CPU model from Kyutai (makers of Moshi) with single-sample voice cloning.

Pocket TTS comes from Kyutai, the lab behind the Moshi speech model, and is built around a transformer paired with the Mimi codec. At just 100M parameters it runs efficiently on CPU, yet it still supports zero-shot voice cloning from a single audio sample — an unusual feature at this size. It covers English and French and handles up to 1,000 characters per request at fast (~2s) speeds. The small footprint and ~1GB VRAM make it a natural fit for edge deployment and low-resource or CPU-only environments where quick voice cloning is needed.

At a glance

Developer
Kyutai
License
MIT
Tier
free
Speed
fast
Voice cloning
Yes
Languages
English, French
Max characters
1000

Pocket TTS AI Voices

Alba

English
Ledig Female
Bruk

Azelma

English
Ledig Female
Bruk

Cosette

English
Ledig Female
Bruk

Eponine

English
Ledig Female
Bruk

Fantine

English
Ledig Female
Bruk

Fantine (French)

French
Ledig Female
Bruk

Javert

English
Ledig Male
Bruk

Jean

English
Ledig Male
Bruk

Jean (French)

French
Ledig Male
Bruk

Marius

English
Ledig Male
Bruk

Best for

Lightweight deployment, CPU-only environments, quick voice cloning

Pocket TTS TTS — FAQ

Yes. Pocket TTS does zero-shot voice cloning from a single reference sample (about 3 seconds), which is notable for a model this small.

Yes. At 100M parameters it runs efficiently on CPU and needs only about 1GB VRAM if a GPU is used, making it well suited to edge and low-resource deployment.

Yes. Pocket TTS is MIT-licensed and in the free tier. It supports English and French.
← All voices