MOSS-TTSD

MOSS-TTSD TTS

A 7B dialogue model that continues conversations from an audio prompt — up to five speakers and 60 minutes of coherent audio.

MOSS-TTSD v1.0 from OpenMOSS is a 7-billion-parameter dialogue text-to-speech model that continues a conversation from a short audio prompt rather than reading isolated lines. It handles up to five simultaneous speakers via [S1]/[S2]-style tags, zero-shot voice cloning from 3-to-10-second references, and stretches of coherent multi-turn dialogue up to 60 minutes long. It is distinct from the OpenMOSS MOSS-TTS model — the TTSD variant is specialized for podcast, audiobook, and dubbing workflows where long, consistent conversational audio is the goal. Released under Apache 2.0, it needs around 12GB of VRAM given its size.

At a glance

Developer
OpenMOSS
License
Apache 2.0
Tier
standard
Speed
medium
Voice cloning
Yes
Languages
English, Chinese
Max characters
5000

MOSS-TTSD AI Voices

Default (Chinese)

Chinese
Szabvány Neutral
Alkalmazás

Default Speaker

English
Szabvány Neutral
Alkalmazás

Best for

Podcasts, audiobooks, dubbed dialogue, conversational content with multiple voices

MOSS-TTSD TTS — FAQ

Up to five simultaneous speakers, addressed via speaker tags like [S1] and [S2], with the ability to clone each voice from a short reference clip.

It can produce up to 60 minutes of coherent multi-turn dialogue, which is what makes it suited to full podcast episodes and audiobook chapters rather than short clips.

MOSS-TTSD is a dialogue-specialized variant that continues conversations from an audio prompt and targets podcast, audiobook, and dubbing workflows, whereas the base MOSS-TTS is a general single-voice synthesis model.
← All voices