Free AI Text to Speech

82M parameters Ultra-fast Expressive voices Multilingual Streaming support

Lightweight 82M parameter model delivering studio-quality speech with blazing-fast inference.

Fast · 1.5GB VRAM Try it

Piper

CPU-friendly Offline capable 100+ voices 30+ languages SSML support

A fast, local neural text to speech system optimized for Raspberry Pi and embedded devices.

Fast · 0 (CPU only) VRAM Try it

VITS

End-to-end synthesis Natural prosody Fast inference Multiple speakers

Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech.

Fast · 1GB VRAM Try it

MeloTTS

CPU-optimized Multilingual Multiple accents Production-ready Low latency

High-quality multilingual text-to-speech that runs on CPU with minimal latency.

Fast · 0.5GB (GPU optional) VRAM Try it

Bark

Sound effects Laughing/sighing Music generation 100+ speakers Multilingual

Transformer-based text-to-audio model that generates realistic speech, music, and sound effects.

Slow · 5GB VRAM Try it

Bark Small

Lightweight Faster than full Bark Emotional speech Multilingual

Lighter version of Bark with faster inference and lower memory usage.

Medium · 2GB VRAM Try it

CosyVoice 2

Streaming Zero-shot cloning Cross-lingual Emotion control Human-parity

Alibaba's scalable streaming TTS with human-parity naturalness and near-zero latency.

Dia TTS

Multi-speaker Dialog generation Natural turn-taking Emotional expression 1.6B parameters

Multi-speaker dialog generation model that creates natural conversations between speakers.

Parler TTS

Voice description Natural language control Flexible voice creation No preset voices needed

Describe the voice you want in natural language and Parler generates matching speech.

IndexTTS-2

Emotion control Zero-shot Emotion vectors Expressive speech Fine-grained control

Zero-shot TTS with fine-grained emotion control and high expressiveness.

Spark TTS

Voice cloning Emotion control Style control Prompt-based 5-second cloning

Voice cloning TTS with controllable emotion and speaking style via prompts.

GPT-SoVITS

5-second cloning Singing voice Few-shot learning High fidelity Cross-lingual

Few-shot voice cloning TTS that replicates any voice from just 5 seconds of audio.

Slow · 6GB VRAM Try it

Orpheus

Human-level emotion 100K hours training Natural emphasis Expressive speech

Human-level emotional TTS model trained on 100K hours of speech data.

Chatterbox

Zero-shot cloning Emotion control High fidelity Style transfer Single sample cloning

State-of-the-art zero-shot voice cloning with emotion control from Resemble AI.

Tortoise TTS

Highest quality Multi-voice DALL-E architecture Voice cloning Autoregressive

Multi-voice text-to-speech focused on quality with autoregressive architecture.

Slow · 8GB VRAM Try it

StyleTTS 2

Human-level Style diffusion Adversarial training Natural variation High fidelity

Human-level text-to-speech through style diffusion and adversarial training.

OpenVoice

Instant cloning Voice conversion Emotion control Accent control Multilingual

Instant voice cloning with granular control over style, emotion, and accent.

Qwen3 TTS

9 preset voices Voice design from text Emotion control 10 languages

Alibaba's multilingual TTS with preset voices and voice design from text.

Medium · 7GB VRAM Try it

VieNeu-TTS-v2

7 preset voices (North + South accents) En-Vi code-switching Voice cloning (3-5s reference) Podcast / multi-speaker support CPU-only — no GPU required

Vietnamese + English code-switching TTS with 7 preset voices and zero-shot voice cloning. CPU-only, no GPU required.

Fast · CPU VRAM Try it

Sesame CSM

Conversational Natural timing Turn-taking Backchannel 1B parameters

Conversational speech model generating natural dialogue with appropriate timing and emotion.

Slow · 8GB VRAM Try it

Chatterbox Turbo

Sub-200ms latency Paralinguistic tags 6x real-time Voice cloning Watermarking

Faster Chatterbox with sub-200ms latency and paralinguistic tags for laughs, coughs, and more.

Fast · 2GB VRAM Try it

VoxCPM

44.1kHz audio Tokenizer-free Cross-lingual cloning Context-aware LoRA fine-tuning

Tokenizer-free TTS producing 44.1kHz audio with context-aware paragraph consistency.

Fast · 4GB VRAM Try it

Kani TTS 2

3GB VRAM Ultra-fast Lightweight NanoCodec Free

Ultra-lightweight 400M English TTS model running in just 3GB VRAM.

Fast · 3GB VRAM Try it

OuteTTS

CPU inference Browser inference Voice cloning Multiple backends Speaker profiles

LLM-based TTS that runs on CPU, GPU, or browser via llama.cpp and Transformers.js.

Fast · 2GB VRAM Try it

VibeVoice

Multi-speaker Long-form (90 min) Podcast generation Dialogue Low latency

Microsoft's multi-speaker long-form TTS generating up to 90 minutes with 4 distinct speakers.

Fast · 4GB VRAM Try it

Pocket TTS

100M parameters CPU inference Voice cloning Single-sample cloning Edge-ready

Lightweight 100M parameter model by Kyutai with voice cloning from a single sample.

Fast · 1GB VRAM Try it

Kitten TTS

CPU-only inference Under 80MB model size 8 built-in voices Speed control ONNX-based 24kHz output

Ultra-lightweight TTS under 80MB. Runs on CPU without GPU.

Fast · 0GB VRAM Try it

CosyVoice3

Bi-streaming Emotion control Voice cloning Speed/volume control Instruction following

Next-generation multilingual TTS with bi-streaming, emotion control, and zero-shot voice cloning.

Fast · 4GB VRAM Try it

NAMAA Saudi TTS

Saudi Arabic dialect Modern Standard Arabic Zero-shot voice cloning Emotion control Native pronunciation

First open Saudi-Arabic TTS. Native Saudi dialect with Chatterbox-quality voice cloning.

Medium · 6GB VRAM Try it

Darwin TTS

Voice cloning Cross-lingual FFN-blended 4 core languages Qwen3 backbone

Cross-modal Qwen3-TTS variant with FFN weights blended from the Qwen3-1.7B language model for sharper multilingual cloning.

Medium · 7GB VRAM Try it

MOSS-TTSD

Multi-speaker dialogue Up to 5 speakers 60min coherent audio Voice cloning Podcast-optimised

Multi-speaker dialogue continuation model — generate podcast-style conversations with up to 5 speakers and 60 minutes of coherent audio.

Medium · 12GB VRAM Try it

Ming-Omni TTS

44.1kHz output Voice cloning Emotion control Dialect control BGM generation Compact 0.5B

Compact 0.5B omni-modal speech model from inclusionAI with high-fidelity 44.1kHz output and zero-shot voice cloning.

Medium · 3GB VRAM Try it

MOSS-TTS Nano