报告错误/功能要求

TTTS Arena-AI 声音模型板

与20+文本对语音模型相比,官方基准、社区评级和并肩比较。

签署自由

逐边比较

文本类型、选择两个模型并比较结果。自由级模型不需要记账。

模式A

模式B

免费的模特儿在没有账户的情况下工作。签名签名比较溢价模型。

模型板

#	型型	正式官方官方官方官方官方	社区社区	速度速度	级别
1	Kokoro Lightweight 82M parameter model delivering studio-quality speech with blazing-fast inference. 82M 1200h 2024	4.8 /5	5.0 /5 1 vote	fast	Free
2	CosyVoice 2 Alibaba's scalable streaming TTS with human-parity naturalness and near-zero latency. 300M 200000h 2024	4.26 /5	尚未表决	medium	Standard
3	Chatterbox State-of-the-art zero-shot voice cloning with emotion control from Resemble AI. 300M 2025	4.25 /5	尚未表决	medium	Premium
4	StyleTTS 2 Human-level text-to-speech through style diffusion and adversarial training. 100M 585h 2024	4.23 /5	尚未表决	medium	Premium
5	Piper A fast, local neural text to speech system optimized for Raspberry Pi and embedded devices. 15M 2023	4.15 /5	尚未表决	fast	Free
6	MeloTTS High-quality multilingual text-to-speech that runs on CPU with minimal latency. 25M 2024	4.13 /5	尚未表决	fast	Free
7	Dia TTS Multi-speaker dialog generation model that creates natural conversations between speakers. 1.6B 2024	4.09 /5	尚未表决	medium	Standard
8	VITS Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech. 25M 585h 2021	4.0 /5	尚未表决	fast	Free
9	Orpheus Human-level emotional TTS model trained on 100K hours of speech data. 3B 100000h 2025	4.0 /5	尚未表决	medium	Standard
10	OpenVoice Instant voice cloning with granular control over style, emotion, and accent. 300M 2024	4.0 /5	尚未表决	medium	Premium
11	IndexTTS-2 Zero-shot TTS with fine-grained emotion control and high expressiveness. 300M 2025	3.91 /5	尚未表决	medium	Standard
12	Spark TTS Voice cloning TTS with controllable emotion and speaking style via prompts. 500M 2025	3.9 /5	尚未表决	medium	Standard
13	Parler TTS Describe the voice you want in natural language and Parler generates matching speech. 880M 45000h 2024	3.83 /5	尚未表决	medium	Standard
14	Tortoise TTS Multi-voice text-to-speech focused on quality with autoregressive architecture. 400M 50000h 2022	3.7 /5	尚未表决	slow	Premium
15	Bark Transformer-based text-to-audio model that generates realistic speech, music, and sound effects. 350M 100000h 2023	3.57 /5	尚未表决	slow	Standard
16	Bark Small Lighter version of Bark with faster inference and lower memory usage. 150M 100000h 2023	—	尚未表决	medium	Standard
17	GPT-SoVITS Few-shot voice cloning TTS that replicates any voice from just 5 seconds of audio. 200M 2024	—	尚未表决	slow	Standard
18	Qwen3 TTS Alibaba's multilingual TTS with preset voices and voice design from text. 1.7B 2025	—	尚未表决	medium	Standard
19	VieNeu-TTS-v2 Vietnamese + English code-switching TTS with 7 preset voices and zero-shot voice cloning. CPU-only, no GPU required. 0.3B 10000h 2026	—	尚未表决	fast	Standard
20	Sesame CSM Conversational speech model generating natural dialogue with appropriate timing and emotion. 1B 2025	—	尚未表决	slow	Premium
21	Chatterbox Turbo Faster Chatterbox with sub-200ms latency and paralinguistic tags for laughs, coughs, and more. 350M 2025	—	尚未表决	fast	Standard
22	VoxCPM Tokenizer-free TTS producing 44.1kHz audio with context-aware paragraph consistency. 500M 1800000h 2025	—	尚未表决	fast	Standard
23	Kani TTS 2 Ultra-lightweight 400M English TTS model running in just 3GB VRAM. 400M 10000h 2026	—	尚未表决	fast	Free
24	OuteTTS LLM-based TTS that runs on CPU, GPU, or browser via llama.cpp and Transformers.js. 1B 5000h 2025	—	尚未表决	fast	Free
25	VibeVoice Microsoft's multi-speaker long-form TTS generating up to 90 minutes with 4 distinct speakers. 1.5B 100000h 2025	—	尚未表决	fast	Standard
26	Pocket TTS Lightweight 100M parameter model by Kyutai with voice cloning from a single sample. 100M 50000h 2025	—	尚未表决	fast	Free
27	Kitten TTS Ultra-lightweight TTS under 80MB. Runs on CPU without GPU. 80M 2025	—	尚未表决	fast	Free
28	CosyVoice3 Next-generation multilingual TTS with bi-streaming, emotion control, and zero-shot voice cloning. 500M 200000h 2025	—	尚未表决	fast	Standard
29	NAMAA Saudi TTS First open Saudi-Arabic TTS. Native Saudi dialect with Chatterbox-quality voice cloning. 300M 2026	—	尚未表决	medium	Standard
30	Darwin TTS Cross-modal Qwen3-TTS variant with FFN weights blended from the Qwen3-1.7B language model for sharper multilingual cloning. 2.1B 2026	—	尚未表决	medium	Standard
31	MOSS-TTSD Multi-speaker dialogue continuation model — generate podcast-style conversations with up to 5 speakers and 60 minutes of coherent audio. 7B 2026	—	尚未表决	medium	Standard
32	Ming-Omni TTS Compact 0.5B omni-modal speech model from inclusionAI with high-fidelity 44.1kHz output and zero-shot voice cloning. 500M 2026	—	尚未表决	medium	Free
33	MOSS-TTS Nano Tiny 100M MOSS-TTS variant — same architecture, 80x smaller, free-tier latency. 100M 500000h 2026	—	尚未表决	fast	Free