Free AI Text to Speech

82M parameters Ultra-fast Expressive voices Multilingual Streaming support

Lightweight 82M parameter model delivering studio-quality speech with blazing-fast inference.

Fast · 1.5GB VRAM Try it

Piper

CPU-friendly Offline capable 100+ voices 30+ languages SSML support

A fast, local neural text to speech system optimized for Raspberry Pi and embedded devices.

Fast · 0 (CPU only) VRAM Try it

VITS

End-to-end synthesis Natural prosody Fast inference Multiple speakers

Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech.

Fast · 1GB VRAM Try it

MeloTTS

CPU-optimized Multilingual Multiple accents Production-ready Low latency

High-quality multilingual text-to-speech that runs on CPU with minimal latency.

Fast · 0.5GB (GPU optional) VRAM Try it

Bark

Sound effects Laughing/sighing Music generation 100+ speakers Multilingual

Transformer-based text-to-audio model that generates realistic speech, music, and sound effects.

Slow · 5GB VRAM Try it

Bark Small

Lightweight Faster than full Bark Emotional speech Multilingual

Lighter version of Bark with faster inference and lower memory usage.

Medium · 2GB VRAM Try it

CosyVoice 2

Streaming Zero-shot cloning Cross-lingual Emotion control Human-parity

Alibaba's scalable streaming TTS with human-parity naturalness and near-zero latency.

Dia TTS

Multi-speaker Dialog generation Natural turn-taking Emotional expression 1.6B parameters

Multi-speaker dialog generation model that creates natural conversations between speakers.

Parler TTS

Voice description Natural language control Flexible voice creation No preset voices needed

Describe the voice you want in natural language and Parler generates matching speech.

GLM-TTS

Lowest error rate Voice cloning Flow matching Natural prosody

Achieves the lowest character error rate among open-source TTS models.

IndexTTS-2

Emotion control Zero-shot Emotion vectors Expressive speech Fine-grained control

Zero-shot TTS with fine-grained emotion control and high expressiveness.

Spark TTS

Voice cloning Emotion control Style control Prompt-based 5-second cloning

Voice cloning TTS with controllable emotion and speaking style via prompts.

GPT-SoVITS

5-second cloning Singing voice Few-shot learning High fidelity Cross-lingual

Few-shot voice cloning TTS that replicates any voice from just 5 seconds of audio.

Slow · 6GB VRAM Try it

Orpheus

Human-level emotion 100K hours training Natural emphasis Expressive speech

Human-level emotional TTS model trained on 100K hours of speech data.

Chatterbox

Zero-shot cloning Emotion control High fidelity Style transfer Single sample cloning

State-of-the-art zero-shot voice cloning with emotion control from Resemble AI.

Tortoise TTS

Highest quality Multi-voice DALL-E architecture Voice cloning Autoregressive

Multi-voice text-to-speech focused on quality with autoregressive architecture.

Slow · 8GB VRAM Try it

StyleTTS 2

Human-level Style diffusion Adversarial training Natural variation High fidelity

Human-level text-to-speech through style diffusion and adversarial training.

OpenVoice

Instant cloning Voice conversion Emotion control Accent control Multilingual

Instant voice cloning with granular control over style, emotion, and accent.

Qwen3 TTS

Voice cloning 9 preset voices Voice design from text Emotion control 10 languages

Alibaba's multilingual TTS with voice cloning, preset voices, and voice design from text.

Medium · 7GB VRAM Try it

Sesame CSM

Conversational Natural timing Turn-taking Backchannel 1B parameters

Conversational speech model generating natural dialogue with appropriate timing and emotion.

Slow · 8GB VRAM Try it

Chatterbox Turbo

Sub-200ms latency Paralinguistic tags 6x real-time Voice cloning Watermarking

Faster Chatterbox with sub-200ms latency and paralinguistic tags for laughs, coughs, and more.

Fast · 2GB VRAM Try it

Dia 2

Streaming output Multi-speaker Low latency Paralinguistic cues Up to 2 min output

Streaming-first conversational TTS with multi-speaker dialogue and paralinguistic cues.

VoxCPM

44.1kHz audio Tokenizer-free Cross-lingual cloning Context-aware LoRA fine-tuning

Tokenizer-free TTS producing 44.1kHz audio with context-aware paragraph consistency.

OuteTTS

CPU inference Browser inference Voice cloning Multiple backends Speaker profiles

LLM-based TTS that runs on CPU, GPU, or browser via llama.cpp and Transformers.js.

Fast · 2GB VRAM Try it

TADA

Zero hallucinations 5x faster than LLM TTS Emotional expression 700s audio context Dual alignment

Zero-hallucination TTS with text-acoustic dual alignment, 5x faster than comparable LLM TTS.

Fast · 5GB VRAM Try it

VibeVoice

Multi-speaker Long-form (90 min) Podcast generation Dialogue Low latency

Microsoft's multi-speaker long-form TTS generating up to 90 minutes with 4 distinct speakers.

Pocket TTS

100M parameters CPU inference Voice cloning Single-sample cloning Edge-ready

Lightweight 100M parameter model by Kyutai with voice cloning from a single sample.

Fast · 1GB VRAM Try it

Kitten TTS

CPU-only inference Under 80MB model size 8 built-in voices Speed control ONNX-based 24kHz output

Ultra-lightweight TTS under 80MB. Runs on CPU without GPU.

Fast · 0GB VRAM Try it

CosyVoice3

Bi-streaming Emotion control Voice cloning Speed/volume control Instruction following

Next-generation multilingual TTS with bi-streaming, emotion control, and zero-shot voice cloning.

MOSS-TTS

Ultra-long generation 20 languages Voice cloning Duration control Pronunciation control Code-switching

Ultra-long 20-language TTS supporting up to 1 hour of continuous generation with phoneme-level control.

Medium · 16GB VRAM Try it

MegaTTS3