AI Text to Speech
Kpọgharịa ngwe ka ọbụla-anụ-anụ na-eji open-source AI models. Free iji jiri, enweghị akaụntụ chọrọ.
Kpọchie ngwe gị n'ime SSML táàbụ̀ maka nlekọta ziri ezi:
<speak><prosody rate="slow">Slow speech</prosody></speak>
Tinye ihenhọrọiheomume maka imetụta mbipụta (nnyemaka móòdù dị iche iche):
Ndesịta okwu emeredịkachọrọ:
Ndesịta ozi ndị ahụ
Kitten TTS
Kitten TTS by KittenML is an ultra-lightweight text-to-speech model built on ONNX. With variants from 15M to 80M parameters (25-80 MB on disk), it delivers high-quality voice synthesis on CPU without requiring a GPU. Features 8 built-in voices, adjustable speech speed, and built-in text preprocessing for numbers, currencies, and units. Ideal for edge deployment and low-latency applications.
| Debanye aha: | KittenML |
| Ikikere: | Apache 2.0 |
| Nhazi | Fast |
| Nhazi: | |
| Asụsụ | 1 Asụsụ |
| VRAM | 0GB |
| Klọnsị ụda | Enweghị nkwado |
Ndụmọdụ maka nsonaazụ ka mma
- Jiri akara ngosi ihe nkesa maka nkwụsịtụ na ntụgharị
- Nkọwapụta mkpụrụokwu na nkwụsịtụ maka nsụgharị n'ụzọ doro anya
- Tinye commas ka ịmepụta nkwụsịtụ n'etiti okwu
- Jiri ellipses (...) maka nkwụsịtụ dị ogologo
- Jiri Kokoro mọọbụ CosyVoice 2 maka nsonaazụ ndị kasị nkịtị
- Jiri Dia maka multi-speaker dialog nakwa ihenhọrọ nke podcast
Ọrụ akara
| _Nhazi | Ọnụahịa maka 1K akara |
|---|---|
| Ọfụụ | 1:1 (free) |
| Dìfọ̀ltụ̀ | 2x akara |
| Premium | 4x akara |
Otu esi emegharị ngwe AI ka ọ bụrụ okwu
Kewapụta profaịlụ-ọfụụ ụdaolu n'ime nzọụkwụ atọ dị mfe. Enweghị nghọta nkà na ụzụ chọrọ.
Tinye ngwe gị
Tinye, pịa, mọọbụ bulie ngwe ịchọrọ ịgbanwee ka ọsụsọ. Na-akwado ruo 5,000 akara n'otu nsụgharị maka ndị ọrụ ahụ banye. Jiri ngwe dị mfe mọọbụ tinye SSML táàbụ̀ maka nlekọta n'elu ngwe, nkwụsị, nakwa n'ịkọwa.
Họrọ ụda na móòdù
Họrọ site na 20+ AI models n'elu tiers atọ. Họrọ olu nke na-adaba na ọdịnaya gị, họrọ asụsụ n'okporo ụzọ gị, gbanwee ọsọ ntụgharị site na 0.5x ruo 2.0x, na họrọ ihe nchọgharị gị (MP3, WAV, OGG, ma ọ bụ FLAC).
Bubata
Pịa Bipụta na ụda gị ga-adị njikere n'ime sekọnd. Preview na onyena-egwuregwu emeredịkachọrọ, budata n'ụdị a họọrọ gị, mọọbụ debata njikọ a ga-ekewapụta. Jiri API maka usoroiheomume batch nakwa iji ya n'ime usoroiheomume gị.
Ọdịnaya ojieme ngwe ka okwu
AI-powered text-to-speech na-agbanwe ụzọ ndị mmadụ si emepụta, jiri, na arụkọ ọrụ ọnụ na ọdịnaya ụda n'ime ọtụtụ ụlọ ọrụ.
Nhazi ngwe niile na-atụgharị n'okwu
Nkọwapụta zuru ezu maka ụdị AI ọ bụla dị na TTS.ai. Tụnyere arụmọrụ, ọsọ, nkwado asụsụ, na atụmatụ iji chọpụta ụdị zuru oke maka ọrụ gị.
Kokoro
Free
Kokoro bụ 82 nde parameters ngwe-na-asị model nke punches mma n'elu ya weightclass. N'agbanyeghị ya obere size, ọ na-emepụta na-asị na-asị na-asị. Kokoro na-akwado asụsụ ndị ọzọ gụnyere English, Japanese, Chinese, na Korean na ụdị okwu ndị ọzọ. Ọ na-arụ ọrụ n'ụzọ dị mfe - na-emepụta ụda dị ka 100x n'ụzọ dị mfe karịa oge n'oge na GPU.
Hexgrad
Apache 2.0
Fast
en, ja, zh, ko, fr, de, it, pt, es, hi, ru
1.5GB
Ọ dịghị
Ọfụụ
Piper
Free
Piper bụ engine ngwe-na-asụsụ na-asụgharị nke Rhasspy na-eji VITS na larynx architectures. Ọ na-arụ ọrụ nke ọma na CPU, na-eme ka ọ dị mma maka ngwaọrụ edge, ụlọ ọrụ na-arụ ọrụ, na ngwa ọrụ chọrọ TTS na-enweghị njikọ. Na okwu 100 n'elu asụsụ 30 +, Piper na-enye okwu na-asụgharị na-asụgharị na-asụgharị na-asụgharị na Raspberry Pi 4.
Rhasspy
MIT
Fast
en, de, fr, es, it, pt, nl, pl, ru, zh, ja, ko, ar, cs, da, fi, el, hu, is, ka, kk, ne, no, ro, sk, sr, sv, sw, tr, uk, vi
0 (CPU only)
Ọ dịghị
Ọfụụ
VITS
Free
VITS (Variational Inference na-amụ ihe na-abịanụ maka ngwụcha-na-abịanụ Text-to-Speech) bụ ụzọ TTS na-abịanụ na-abịanụ nke na-emepụta ụda dị mma karịa ụdị abụọ nke ugbu a. Ọ na-ahọrọ ntụgharị dị iche iche na-agbakwunye na ntụgharị na usoro nkuzi na-abịanụ, na-enwetakwa mmelite dị mkpa na nghọta.
Jaehyeon Kim et al.
MIT
Fast
en, zh, ja, ko
1GB
Ọ dịghị
Ọfụụ
MeloTTS
Free
MeloTTS site na MyShell.ai bụ TTS multilingual library na-akwado English (American, British, Indian, Australian), Spanish, French, Chinese, Japanese, na Korean. Ọ dị ngwa ngwa, na-arụ ọrụ ngwe na-adịgide adịgide na CPU naanị. MeloTTS ejirila maka iji mmepụta na-akwado CPU na GPU inference.
MyShell.ai
MIT
Fast
en, es, fr, zh, ja, ko
0.5GB (GPU optional)
Ọ dịghị
Ọfụụ
Bark
Standard
Bark site na Suno bụ a transformer-based ngwe-na-audio model nke nwere ike ịmepụta dị elu-realistic, multilingual okwu dị kakwa ndị ọzọ audio dị ka music, n'okpuru ụda, na ụda mmetụta. Ọ nwere ike ịmepụta nonverbal nkwukọrịta dị ka na-eche, na-asị, na-asị. Bark na-akwado karịa 100 speaker presets na 13+ asụsụ.
Suno
MIT
Slow
en, zh, fr, de, hi, it, ja, ko, pl, pt, ru, es, tr
5GB
Ọ dịghị
2x
Bark Small
Standard
Bark Small bụ ụdị distilled nke Bark model nke na-echekwa ụfọdụ ụda maka ọsọ ọsọ inference dị n'elu nakwa ihenhọrọ n'ime n'ime. Ọ na-echekwa ike Bark nke ịmepụta okwu na-eji mmetụta, ịnụ ọkụ n'obi, na asụsụ ndị ọzọ.
Suno
MIT
Medium
en, zh, fr, de, hi, it, ja, ko, pl, pt, ru, es, tr
2GB
Ọ dịghị
2x
CosyVoice 2
Standard
CosyVoice 2 site na Tongyi Lab nke Alibaba na-eme ka ọ bụrụ ihe dị mma maka ngwa ọrụ oge dị n'ezie. Ọ na-eji ụzọ dị n'ime nke dị n'ime maka ntụgharị okwu na-aga n'ihu na-akwado ịkọ okwu nke enweghị ntụgharị, ntụgharị asụsụ ntụgharị, na nchịkwa mmetụta uche nke dị n'ime. Ọ na-eme ka ọtụtụ usoro TTS dị n'ime nghọta.
Alibaba (Tongyi Lab)
Apache 2.0
Medium
en, zh, ja, ko, fr, de, it, es
4GB
Ee
2x
Dia TTS
Standard
Dia site na Nari Labs bụ 1.6B parameter text-to-speech model nke e mepụtara n'ụzọ pụrụ iche maka ịmepụta multi-speaker dialog. Ọ nwere ike ịmepụta okwu okwu na-atọ ụtọ n'etiti ndị na-ekwu okwu abụọ na-ewere ọnọdụ dị mma, prosody, na ngosipụta mmetụta uche. Dia bụ nke zuru ezu maka ịmepụta ọdịnaya nke ụdị podcast, audiobook dialogs, na interactive conversational AI.
Nari Labs
Apache 2.0
Medium
en
4GB
Ọ dịghị
2x
Parler TTS
Standard
Parler TTS bụ ngwe-na-asụsụ model nke na-eji nkọwa okwu asụsụ na-emegharị emegharị iji chịkwaa okwu a haziri. N'ebe ịhọrọ site n'asụsụ ndị a haziri, ị ga-akọwapụta okwu ịchọrọ (eg, "a warm female voice with a slight British accent, speaking slowly and clearly") na Parler ga-apụta okwu dị ka nkọwa ahụ. Nke a na-eme ka ọ dị mfe maka usoroiheomume ndị na-emegharị emegharị.
Hugging Face
Apache 2.0
Medium
en
4GB
Ọ dịghị
2x
GLM-TTS
Standard
GLM-TTS site na Zhipu AI bụ usoro iheomume-na-asụsụ nke e wuru na Llama architecture na-atọgharị. Ọ na-eme ka ọ bụrụ ihe dị ala nke ihe nchọgharị n'etiti open-source TTS models, nke pụtara na ọ na-emepụta ihe kachasị dị mma. GLM-TTS na-akwado English na Chinese na ụda cloning site na 3-10 sekọnd audio samples.
Zhipu AI
GLM-4 License
Medium
en, zh
4GB
Ee
2x
IndexTTS-2
Standard
IndexTTS-2 bụ usoro iheomume ngwe-na-asụsụ nke na-akwalite na nsụgharị ụda nke na-enweghị ntọala na nchịkwa mmetụta uche nke na-akpụ akpụ. Ọ nwere ike ịmepụta okwu na ụda mmetụta uche dị iche iche dị ka obi ụtọ, ọnwụ, ọdachi, ma ọ bụ ọdachi na-enweghị mkpa data nkuzi mmetụta uche. Model ahụ na-eji mmetụta uche vector iji chịkwaa ngosipụta mmetụta uche nke okwu a na-emepụta.
Index Team
Bilibili Model License
Medium
en, zh
4GB
Ee
2x
Spark TTS
Standard
Spark TTS site na SparkAudio bụ móòdù ngwe-ka-asụsụ nke na-asụgharị ụda n'otu n'otu n'ime ụda na-asụgharị n'otu n'otu n'ime ụda na-asụgharị n'otu n'otu. N'iji sekọnd 5 nke ụda na-asụgharị n'otu n'otu, ọ nwere ike ịsụgharị ụda n'otu n'otu ma ọ bụ mepụta ụda na-asụgharị n'otu n'otu n'ime ụda na-asụgharị n'otu n'otu. Spark TTS na-eji usoroiheomume nlekọta nke na-adabere n'ịjụjụ.
SparkAudio
CC BY-NC-SA 4.0
Medium
en, zh
4GB
Ee
2x
GPT-SoVITS
Standard
GPT-SoVITS na-ejikọta GPT-style asụsụ modeling na SoVITS (Singing Voice Inference via Translation na Synthesis) maka ike-ọnụ-ọnụ-ọnụ. Na 5 sekọnd nke ọwa ntụgharị, ọ nwere ike ịkọ ụda na-ezighi ezi nakwa ịmepụta ụda ọhụrụ mgbe ọ na-echekwa ihenhọrọ nke onye na-ekwu okwu. Ọ dị mma na ọbụna okwu na-edegharị ụda.
RVC-Boss
MIT
Slow
en, zh, ja, ko
6GB
Ee
2x
Orpheus
Standard
Orpheus bụ nnukwu-ụdị ngwe-na-asụsụ model nke na-abịarute n'ihe na-eme n'ihe na-eme n'ihe na-eme n'ihe na-eme n'ihe na-eme n'ihe na-eme n'ihe na-eme n'ihe na-eme n'ihe na-eme n'ihe na-eme n'ihe na-eme n'ihe na-eme n'ihe na-eme n'ihe na-eme n'ihe na-eme n'ihe na-eme n'ihe na-eme n'ihe na-eme n'ihe na-eme n'ihe na-eme n'ihe na-eme n'ihe na-eme n'ihe na-eme n'ihe na-eme n'ihe na-eme n'ihe na-eme n'ihe na-eme n'ihe na-eme n'ihe na-eme n'ihe na-eme n'ihe na-eme n'ihe na-eme n'ihe na-eme n'ihe na-eme n'ihe na-eme n'ihe na-eme n'ihe na-eme n'ihe na-eme n'ihe na-eme n'ihe na-eme n'ihe na-eme n'ihe na-eme n'ihe na-eme n'ihe na-eme n'ihe na-eme n'ihe na-eme n'ihe na-eme n'ihe na-eme n'ihe na-eme n'ihe na-eme n'ihe na-eme n'ihe na-eme n'ihe na-eme n'ihe na-eme n'ihe na-eme n'ihe na-eme n'ihe na-eme n'ihe na-eme n'ihe na-eme n'ihe na-eme n'ihe na-eme n'ihe na-eme n'ihe na-eme n'ihe na-eme n'ihe na-eme n'ihe na-eme n'ihe na-eme n'ihe na-eme n'ihe na-eme n'ihe na-eme n'ihe na-eme n'ihe na-eme n'ihe na-eme n'ihe na-eme n'ihe na-eme n'ihe na-eme n'ihe na-eme n'ihe na-eme n'ihe na-eme n'ihe na-eme n'ihe na-eme n
Canopy Labs
Llama 3.2 Community
Medium
en
4GB
Ọ dịghị
2x
Chatterbox
Premium
Chatterbox site na Resemble AI bụ a cutting-edge zero-shot ụda cloning model. Ọ nwere ike ịgbanwe ụda ọ bụla site n'otu ụda sample na nghọta zuru oke, na-echekwa ọ bụghị nanị timbre kamakwa ụda na-ekwu okwu na ụda ụda. Chatterbox nwekwara ike ịhazi ụda ụda nke ụda ụda, na-enye gị ohere ịhazi ụda ụda ụda nke ụda ụda na-apụta site na ụda ụda.
Resemble AI
MIT
Medium
en
4GB
Ee
4x
Tortoise TTS
Premium
Tortoise TTS bụ usoroiheomume ngwe-ka-asụsụ nke na-ahazi ụda n'elu ọsọ. Ọ na-eji DALL-E-inspired architecture iji mepụta asụsụ na-adịgide adịgide na-eji prosody dị mma nakwa onye na-ekwu okwu dị ka nke ahụ. Otú ọ dị, Tortoise na-emepụta ụfọdụ n'ime asụsụ na-adịgide adịgide dị n'ime open-source ecosystem.
James Betker
Apache 2.0
Slow
en
8GB
Ee
4x
StyleTTS 2
Premium
StyleTTS 2 na-eme ka TTS dị ala na-eme ka ọ bụrụ nke mmadụ site n'ịgwakọta style diffusion na nkuzi n'ụzọ na-ejikwa asụsụ asụsụ dị ukwuu. Ọ na-eweta okwu na-atọ ụtọ n'etiti ụdị onye na-ekwu okwu, na-emeri ndị mmadụ na-edekọ. StyleTTS 2 na-eji diffusion-based style modeling iji hụ n'ụzọ zuru ezu n'ụdị okwu mmadụ.
Columbia University
MIT
Medium
en
4GB
Ọ dịghị
4x
OpenVoice
Premium
OpenVoice site na MyShell.ai na-enye ohere ịkọgharị ụda n'oge na-adịghị anya na nlekọta n'ime n'ime n'ime n'ime n'ime n'ime n'ime n'ime n'ime n'ime n'ime n'ime n'ime n'ime n'ime n'ime n'ime n'ime n'ime n'ime n'ime n'ime n'ime n'ime n'ime n'ime n'ime n'ime n'ime n'ime n'ime n'ime n'ime n'ime n'ime n'ime n'ime n'ime n'ime n'ime n'ime n'ime n'ime n'ime n'ime n'ime n'ime n'ime n'ime n'ime n'ime n'ime n'ime n'ime n'ime n'ime n'ime n'ime n'ime n'ime n'ime n'ime n'ime n'ime n'ime n'ime n'ime n'ime n'ime n'ime n'ime n'ime n'ime n'ime n'ime n'ime n'ime n'ime n'ime n'ime n'ime n'ime n'ime n'ime n'ime n'ime n'ime n'ime n'ime n'ime n'ime n'ime n'ime n'ime n'ime n'ime n'ime n'ime n'ime n'ime n'ime n'ime n'ime n'ime n'ime n'ime n'ime n'ime n'ime n'ime n'ime n'ime n'ime n'ime n'ime n'ime n'ime n'ime n'ime n'ime n'ime n'ime n'ime n'ime n'ime n'ime n'ime n'ime n'ime n'ime n'ime n'ime n'ime n'ime n'ime n'ime n'ime n'ime n'ime n'ime n'ime n'ime n'ime n'ime n'ime n'ime n'ime n'ime n'ime n'ime n'ime n'ime n'ime n'ime n'ime n'ime n'ime n'ime n'ime n'ime n
MyShell.ai / MIT
MIT
Medium
en, zh, ja, ko, fr, de, es, it
4GB
Ee
4x
Qwen3 TTS
Standard
Qwen3-TTS bụ 1.7 billion parameter text-to-speech model site na Alibaba's Qwen team. Ọ na-akwado modes atọ: preset voices na emotion control (9 speakers), ụda cloning site na 3 sekọnd nke ụda, na a nzuzo ụda design mode ebe ị na-egosi ụda ịchọrọ na asụsụ na-emeghe. Ọ na-ekpuchi asụsụ 10 na-egosi elu na na-emeghe prosody.
Alibaba (Qwen)
Apache 2.0
Medium
en, zh, ja, ko, de, fr, ru, pt, es, it
7GB
Ee
2x
Sesame CSM
Premium
Sesame CSM (Conversational Speech Model) bụ 1 billion parameter model nke e mepụtara n'ụzọ pụrụ iche maka ịmepụta okwu okwu. Ọ na-emezigharị usoro ihe omume nke okwu okwu mmadụ gụnyere ntụgharị oge, nzaghachi backchannel, mmetụta uche, na ntụgharị okwu. CSM na-emezigharị ụda nke dị ka okwu okwu mmadụ n'onwe ya ka ọ bụrụ okwu okwu synthetic.
Sesame
Apache 2.0
Slow
en
8GB
Ọ dịghị
4x
Kitten TTS
Free
Kitten TTS by KittenML is an ultra-lightweight text-to-speech model built on ONNX. With variants from 15M to 80M parameters (25-80 MB on disk), it delivers high-quality voice synthesis on CPU without requiring a GPU. Features 8 built-in voices, adjustable speech speed, and built-in text preprocessing for numbers, currencies, and units. Ideal for edge deployment and low-latency applications.
KittenML
Apache 2.0
Fast
en
0GB
Ọ dịghị
Ọfụụ
Kokoro
Ọfụụ
Kokoro is an 82 million parameter text-to-speech model that punches well above its weight class. Despite its tiny size, it produces remarkably natural and expressive speech. Kokoro supports multiple languages including English, Japanese, Chinese, and Korean with a variety of expressive voices. It runs incredibly fast — generating audio nearly 100x faster than real-time on a GPU.
Hexgrad
Apache 2.0
Fast
Piper
Ọfụụ
Piper is a lightweight text-to-speech engine developed by Rhasspy that uses VITS and larynx architectures. It runs entirely on CPU, making it ideal for edge devices, home automation, and applications requiring offline TTS. With over 100 voices across 30+ languages, Piper delivers natural-sounding speech at real-time speeds even on a Raspberry Pi 4.
Rhasspy
MIT
Fast
VITS
Ọfụụ
VITS (Variational Inference with adversarial learning for end-to-end Text-to-Speech) is a parallel end-to-end TTS method that generates more natural sounding audio than current two-stage models. It adopts variational inference augmented with normalizing flows and an adversarial training process, achieving a significant improvement in naturalness.
Jaehyeon Kim et al.
MIT
Fast
MeloTTS
Ọfụụ
MeloTTS by MyShell.ai is a multilingual TTS library supporting English (American, British, Indian, Australian), Spanish, French, Chinese, Japanese, and Korean. It is extremely fast, processing text at near real-time speed on CPU alone. MeloTTS is designed for production use and supports both CPU and GPU inference.
MyShell.ai
MIT
Fast
Kitten TTS
Ọfụụ
Kitten TTS by KittenML is an ultra-lightweight text-to-speech model built on ONNX. With variants from 15M to 80M parameters (25-80 MB on disk), it delivers high-quality voice synthesis on CPU without requiring a GPU. Features 8 built-in voices, adjustable speech speed, and built-in text preprocessing for numbers, currencies, and units. Ideal for edge deployment and low-latency applications.
KittenML
Apache 2.0
Fast
Bark
Dìfọ̀ltụ̀
Bark by Suno is a transformer-based text-to-audio model that can generate highly realistic, multilingual speech as well as other audio like music, background noise, and sound effects. It can produce nonverbal communications like laughing, sighing, and crying. Bark supports over 100 speaker presets and 13+ languages.
Suno
MIT
Slow
en, zh, fr, de, hi, it, ja, ko, pl, pt, ru, es, tr
Ọ dịghị
Bark Small
Dìfọ̀ltụ̀
Bark Small is a distilled version of the Bark model that trades some audio quality for significantly faster inference speeds and lower memory requirements. It retains Bark's ability to generate speech with emotions, laughter, and multiple languages.
Suno
MIT
Medium
en, zh, fr, de, hi, it, ja, ko, pl, pt, ru, es, tr
Ọ dịghị
CosyVoice 2
Dìfọ̀ltụ̀
CosyVoice 2 by Alibaba's Tongyi Lab achieves human-comparable speech quality with extremely low latency, making it ideal for real-time applications. It uses a finite scalar quantization approach for streaming synthesis and supports zero-shot voice cloning, cross-lingual synthesis, and fine-grained emotion control. It outperforms many commercial TTS systems in subjective evaluations.
Alibaba (Tongyi Lab)
Apache 2.0
Medium
en, zh, ja, ko, fr, de, it, es
Ee
Dia TTS
Dìfọ̀ltụ̀
Dia by Nari Labs is a 1.6B parameter text-to-speech model designed specifically for generating multi-speaker dialogue. It can produce natural-sounding conversations between two speakers with appropriate turn-taking, prosody, and emotional expression. Dia is perfect for creating podcast-style content, audiobook dialogues, and interactive conversational AI.
Nari Labs
Apache 2.0
Medium
en
Ọ dịghị
Parler TTS
Dìfọ̀ltụ̀
Parler TTS is a text-to-speech model that uses natural language voice descriptions to control the generated speech. Instead of selecting from preset voices, you describe the voice you want (e.g., "a warm female voice with a slight British accent, speaking slowly and clearly") and Parler generates speech matching that description. This makes it uniquely flexible for creative applications.
Hugging Face
Apache 2.0
Medium
en
Ọ dịghị
GLM-TTS
Dìfọ̀ltụ̀
GLM-TTS by Zhipu AI is a text-to-speech system built on the Llama architecture with flow matching. It achieves the lowest character error rate among open-source TTS models, meaning it produces the most accurate pronunciation. GLM-TTS supports English and Chinese with voice cloning from 3-10 second audio samples.
Zhipu AI
GLM-4 License
Medium
en, zh
Ee
IndexTTS-2
Dìfọ̀ltụ̀
IndexTTS-2 is an advanced text-to-speech system that excels at zero-shot voice synthesis with fine-grained emotion control. It can generate speech with specific emotional tones like happy, sad, angry, or fearful without requiring emotion-specific training data. The model uses emotion vectors to precisely control the emotional expression of generated speech.
Index Team
Bilibili Model License
Medium
en, zh
Ee
Spark TTS
Dìfọ̀ltụ̀
Spark TTS by SparkAudio is a text-to-speech model that combines voice cloning with controllable emotion and speaking style. Using just 5 seconds of reference audio, it can clone a voice and then generate speech with different emotions, speeds, and styles while maintaining the cloned voice identity. Spark TTS uses a prompt-based control system.
SparkAudio
CC BY-NC-SA 4.0
Medium
en, zh
Ee
GPT-SoVITS
Dìfọ̀ltụ̀
GPT-SoVITS combines GPT-style language modeling with SoVITS (Singing Voice Inference via Translation and Synthesis) for powerful few-shot voice cloning. With as little as 5 seconds of reference audio, it can accurately clone a voice and generate new speech while preserving the speaker's unique characteristics. It excels at both speaking and singing voice synthesis.
RVC-Boss
MIT
Slow
en, zh, ja, ko
Ee
Orpheus
Dìfọ̀ltụ̀
Orpheus is a large-scale text-to-speech model that achieves human-level emotional expression. Trained on over 100,000 hours of diverse speech data, it excels at generating speech with natural emotions, emphasis, and speaking styles. Orpheus can produce speech that is virtually indistinguishable from human recordings.
Canopy Labs
Llama 3.2 Community
Medium
en
Ọ dịghị
Qwen3 TTS
Dìfọ̀ltụ̀
Qwen3-TTS is a 1.7 billion parameter text-to-speech model from Alibaba's Qwen team. It supports three modes: preset voices with emotion control (9 speakers), voice cloning from just 3 seconds of audio, and a unique voice design mode where you describe the voice you want in natural language. It covers 10 languages with high expressiveness and natural prosody.
Alibaba (Qwen)
Apache 2.0
Medium
en, zh, ja, ko, de, fr, ru, pt, es, it
Ee
Model Comparison Table
| Móòdù | Debanye aha: | _Nhazi | Nhazi: | Nhazi | Asụsụ | Klọnsị ụda | VRAM | Ikikere: | Nri | |
|---|---|---|---|---|---|---|---|---|---|---|
| Kokoro | Hexgrad | Free | Fast | 11 | 1.5GB | Apache 2.0 | Ọfụụ | Jiri | ||
| Piper | Rhasspy | Free | Fast | 31 | 0 (CPU only) | MIT | Ọfụụ | Jiri | ||
| VITS | Jaehyeon Kim et al. | Free | Fast | 4 | 1GB | MIT | Ọfụụ | Jiri | ||
| MeloTTS | MyShell.ai | Free | Fast | 6 | 0.5GB (GPU optional) | MIT | Ọfụụ | Jiri | ||
| Bark | Suno | Standard | Slow | 13 | 5GB | MIT | 2 | Jiri | ||
| Bark Small | Suno | Standard | Medium | 13 | 2GB | MIT | 2 | Jiri | ||
| CosyVoice 2 | Alibaba (Tongyi Lab) | Standard | Medium | 8 | 4GB | Apache 2.0 | 2 | Jiri | ||
| Dia TTS | Nari Labs | Standard | Medium | 1 | 4GB | Apache 2.0 | 2 | Jiri | ||
| Parler TTS | Hugging Face | Standard | Medium | 1 | 4GB | Apache 2.0 | 2 | Jiri | ||
| GLM-TTS | Zhipu AI | Standard | Medium | 2 | 4GB | GLM-4 License | 2 | Jiri | ||
| IndexTTS-2 | Index Team | Standard | Medium | 2 | 4GB | Bilibili Model License | 2 | Jiri | ||
| Spark TTS | SparkAudio | Standard | Medium | 2 | 4GB | CC BY-NC-SA 4.0 | 2 | Jiri | ||
| GPT-SoVITS | RVC-Boss | Standard | Slow | 4 | 6GB | MIT | 2 | Jiri | ||
| Orpheus | Canopy Labs | Standard | Medium | 1 | 4GB | Llama 3.2 Community | 2 | Jiri | ||
| Chatterbox | Resemble AI | Premium | Medium | 1 | 4GB | MIT | 4 | Jiri | ||
| Tortoise TTS | James Betker | Premium | Slow | 1 | 8GB | Apache 2.0 | 4 | Jiri | ||
| StyleTTS 2 | Columbia University | Premium | Medium | 1 | 4GB | MIT | 4 | Jiri | ||
| OpenVoice | MyShell.ai / MIT | Premium | Medium | 8 | 4GB | MIT | 4 | Jiri | ||
| Qwen3 TTS | Alibaba (Qwen) | Standard | Medium | 10 | 7GB | Apache 2.0 | 2 | Jiri | ||
| Sesame CSM | Sesame | Premium | Slow | 1 | 8GB | Apache 2.0 | 4 | Jiri | ||
| Kitten TTS | KittenML | Free | Fast | 1 | 0GB | Apache 2.0 | Ọfụụ | Jiri |
Nhazi nke ngwe ka ọsụsọ
Gịnị mere ị ga-eji họrọ TTS.ai maka ederede ka okwu?
TTS.ai na-ejikọta ọnụ ụwa kachasị mma nke na-eme ka ntinye ederede na-ekwu okwu na otu, dị mfe iji wụnye. Dị ka ọrụ ndị na-enye gị ohere ịbanye na engine okwu, TTS.ai na-enye gị ohere ịnweta 20 + ụdị site na ụlọ ọrụ nyocha na-ahụ maka Coqui, MyShell, Amphion, NVIDIA, Suno, HuggingFace, Tsinghua University, na ndị ọzọ.
Model ọ bụla bụ ohuru ohuru n'okpuru MIT, Apache 2.0, ma ọ bụ ndị ọzọ dị ka ikike ikike, na-eme ka ị nwee ikike zuru oke iji jiri ụda na-emepụta na ọrụ gị. Ọ bụrụ na ịchọrọ ngwa ngwa, n'ihi na ịchọrọ ngwa ngwa maka ngwa ngwa oge ma ọ bụ n'ihi na ịchọrọ ntinye akwụkwọ maka akwụkwọ ụda na podcasts, TTS.ai nwere ụdị ziri ezi maka ihe niile.
Free Models, No Account Required
Bido n'oge na-adịghị anya na free TTS models atọ: Piper (ultra-fast, lightweight), VITS (high-quality neural synthesis), na MeloTTS (multi-language support). Ọ dịghị ndebanye aha, ọ dịghị kaadị akwụmụgwọ, ọ dịghị oke na nsụgharị. Free models na-akwado English na asụsụ ndị ọzọ dị iche iche na natural-sounding output dị mma maka ọtụtụ usoro ihe omume.
Nhazi GPU-emegharịrị
TTS niile na-arụ ọrụ na NVIDIA GPUs maka oge nrụpụta dị n'ụzọ dị mfe. Free models na-arụ ọrụ n'okpuru sekọnd 2. Standard models dị ka Kokoro, CosyVoice 2, na Bark average 3-5 sekọnd. Premium models na-arụ ọrụ na sekọnd 5-15 dabere na ogologo ngwe.
30+ Asụsụ
Ọrụ na-emegharị okwu n'ihe karịrị asụsụ 30 gụnyere English, Spanish, French, German, Italian, Portuguese, Chinese, Japanese, Korean, Arabic, Hindi, Russian, na ọtụtụ ndị ọzọ. Models dị iche iche na-akwado cross-language synthesis, nke pụtara na ị nwere ike ịmepụta okwu n'asụsụ nke ụda mbụ ahụ anaghị arụ ọrụ na ya. CosyVoice 2 na GPT-SoVITS na-arụ ọrụ na cross-language voice cloning.
Developer-Ready API
Nweta TTS.ai n'ime ngwa gị na OpenAI-compatible REST API. Otu n'ime ihe ngwụcha maka 20 + ụdị niile. Python, JavaScript, cURL, na Go SDKs. Streaming nkwado maka ngwa ngwa oge. Batch usoro maka nnukwu-scale ọdịnaya generation. Webhooks maka async ozi. Available na Pro na Enterprise plans.
Ajụjụ ndị a na-ajụkarị
Gịnị ka anyị ga-eme ka ọ dịrị mma? Ntụziaka gị na-enyere anyị aka idozi nsogbu.
Bido ịgbanwe ngwe ka ọsụsọ ugbua
Join nde ndị na-emepụta na-eji TTS.ai. Get 15,000 free characters na akaụntụ ọhụrụ. Free models available without signup.