MOSS-TTSD TTS
A 7B dialogue model that continues conversations from an audio prompt — up to five speakers and 60 minutes of coherent audio.
MOSS-TTSD v1.0 from OpenMOSS is a 7-billion-parameter dialogue text-to-speech model that continues a conversation from a short audio prompt rather than reading isolated lines. It handles up to five simultaneous speakers via [S1]/[S2]-style tags, zero-shot voice cloning from 3-to-10-second references, and stretches of coherent multi-turn dialogue up to 60 minutes long. It is distinct from the OpenMOSS MOSS-TTS model — the TTSD variant is specialized for podcast, audiobook, and dubbing workflows where long, consistent conversational audio is the goal. Released under Apache 2.0, it needs around 12GB of VRAM given its size.
At a glance
- Developer
- OpenMOSS
- License
- Apache 2.0
- Tier
- standard
- Speed
- medium
- Voice cloning
- Yes
- Languages
- English, Chinese
- Max characters
- 5000
MOSS-TTSD voices
Best for
Podcasts, audiobooks, dubbed dialogue, conversational content with multiple voices