Sesame CSM

Sesame CSM TTS

A 1B conversational speech model that captures natural dialogue timing, turn-taking, and backchannel responses.

Sesame CSM (Conversational Speech Model) is a 1-billion-parameter model from Sesame designed specifically for the rhythms of human conversation. Built on a Llama backbone paired with an audio codec, it models turn-taking timing, backchannel responses (the small acknowledgements people make while listening), emotional reactions, and overall conversational flow. The result reads less like read-aloud text and more like a real spoken exchange. It is a natural fit for AI assistants, chatbots, and conversational interfaces where the goal is speech that feels responsive and human. CSM is released under Apache 2.0, and access on TTS.ai requires a Hugging Face token at the model level.

At a glance

Developer
Sesame
License
Apache 2.0
Tier
premium
Speed
slow
Voice cloning
No
Languages
English
Max characters
500

Sesame CSM AI Voices

Speaker 0

English
Premium Neutral
Ús

Speaker 1

English
Premium Neutral
Ús

Best for

AI assistants, chatbots, conversational AI applications

Sesame CSM TTS — FAQ

Conversational speech. It models the natural patterns of dialogue — turn-taking timing, backchannel responses, and emotional reactions — so generated audio sounds like a real conversation rather than synthetic narration.

It is a 1-billion-parameter model built on a Llama backbone with an audio codec for waveform generation.

AI assistants, chatbots, and other conversational applications where responsive, human-sounding speech matters more than long-form narration.
← All voices