Sesame CSM TTS
A 1B conversational speech model that captures natural dialogue timing, turn-taking, and backchannel responses.
Sesame CSM (Conversational Speech Model) is a 1-billion-parameter model from Sesame designed specifically for the rhythms of human conversation. Built on a Llama backbone paired with an audio codec, it models turn-taking timing, backchannel responses (the small acknowledgements people make while listening), emotional reactions, and overall conversational flow. The result reads less like read-aloud text and more like a real spoken exchange. It is a natural fit for AI assistants, chatbots, and conversational interfaces where the goal is speech that feels responsive and human. CSM is released under Apache 2.0, and access on TTS.ai requires a Hugging Face token at the model level.
At a glance
- Developer
- Sesame
- License
- Apache 2.0
- Tier
- premium
- Speed
- slow
- Voice cloning
- No
- Languages
- English
- Max characters
- 500
Sesame CSM AI Voices
Best for
AI assistants, chatbots, conversational AI applications