Pocket TTS

Pocket TTS TTS

A compact 100M-parameter CPU model from Kyutai (makers of Moshi) with single-sample voice cloning.

Pocket TTS comes from Kyutai, the lab behind the Moshi speech model, and is built around a transformer paired with the Mimi codec. At just 100M parameters it runs efficiently on CPU, yet it still supports zero-shot voice cloning from a single audio sample — an unusual feature at this size. It covers English and French and handles up to 1,000 characters per request at fast (~2s) speeds. The small footprint and ~1GB VRAM make it a natural fit for edge deployment and low-resource or CPU-only environments where quick voice cloning is needed.

At a glance

Developer
Kyutai
License
MIT
Tier
free
Speed
fast
Voice cloning
Yes
Languages
English, French
Max characters
1000

Pocket TTS AI Voices

Alba

English
ឥត​គិត​ថ្លៃ Female
ប្រើ

Azelma

English
ឥត​គិត​ថ្លៃ Female
ប្រើ

Cosette

English
ឥត​គិត​ថ្លៃ Female
ប្រើ

Eponine

English
ឥត​គិត​ថ្លៃ Female
ប្រើ

Fantine

English
ឥត​គិត​ថ្លៃ Female
ប្រើ

Fantine (French)

French
ឥត​គិត​ថ្លៃ Female
ប្រើ

Javert

English
ឥត​គិត​ថ្លៃ Male
ប្រើ

Jean

English
ឥត​គិត​ថ្លៃ Male
ប្រើ

Jean (French)

French
ឥត​គិត​ថ្លៃ Male
ប្រើ

Marius

English
ឥត​គិត​ថ្លៃ Male
ប្រើ

Best for

Lightweight deployment, CPU-only environments, quick voice cloning

Pocket TTS TTS — FAQ

Yes. Pocket TTS does zero-shot voice cloning from a single reference sample (about 3 seconds), which is notable for a model this small.

Yes. At 100M parameters it runs efficiently on CPU and needs only about 1GB VRAM if a GPU is used, making it well suited to edge and low-resource deployment.

Yes. Pocket TTS is MIT-licensed and in the free tier. It supports English and French.
← All voices