Yintoni i-Text to Speech (TTS)?

Umbhalo-ukuthetha yitekhnoloji eguqula umbhalo obhaliweyo ube yisandi esithethayo usebenzisa ubuchule obunobuchule. Ukusuka kwi-robot synthesizers yangaphambili ukuya kumanethiwekhi engqondo emihla ngemihla athetha ngokungafaniyo nomuntu, i-TTS iguqule indlela esinxibelelana ngayo netekhnoloji, sisebenzisa imixholo, kwaye siyenza ulwazi lufikeleleke.

Itekhnoloji Imbali Indlela esebenza ngayo I-Neural Networks I-Evolution

Iinkqubo eziphambili zoMbhalo ukuya kuSenzo

Ukuqonda iindidi zokwakha zosasazo lwelizwi elitsha

I-TTS ithetha ntoni

I-TTS ithetha ukuba Umbhalo-usuka-ku-ukuthetha — iteknoloji eguqula umbhalo obhalwe kwisandi esithethayo usebenzisa ilizwi eliveliswe yikhompyutha.

Indlela i-Neural TTS isebenza ngayo

I-TTS yangoku isebenzisa amajelo anzulu e-neural ukuvavanya umbhalo, ukulinganisa iipateni zokuthetha, kunye nokuvelisa ii-waveforms zesandi ezivakala ngathi zenziwe ngumntu.

Imbali yoMbhalo oPhononongwa

Ukusuka kwiminyaka ye-1960 yenkqubo esekelwe kwimithetho ukuya kwiminyaka ye-1990 yokudibanisa i-synthesis ukuya kwimodeli ye-neural yemihla ngemihla - indlela i-TTS eguqukayo kwiminyaka emihlanu.

Iimodeli ze-AI eziphambili

Iimodeli zemihla ngemihla ezinje ngeKokoro, Bark, kunye neCosyVoice 2 zisebenzisa ii-transformers, ukusasazeka, kunye nokuqwalaselwa kweempawu zokuba zifezekise umgangatho wokuthetha ophezulu.

Iinkqubo eziqhelekileyo

I-TTS inamandla okufundela iscreen, ukuqhubela phambili kweGPS, abancedisi ababonakalayo, iincwadi zesandi, iibots zonkonzo yabathengi, iinkqubo zokufundela nge-e-learning, kunye nokwenza imixholo.

I-Open Source vs. Intengiso

Iimodeli ezivulekileyo (MIT, Apache 2.0) zibonelela nge-TTS ekhululekileyo, ekwazi ukuhonjiswa ngokwayo ngelixa iinkonzo zentengiso zibonelela nge-API ezilawulwayo ezine-SLAs kunye noncedo.

Iimodeli ze-TTS ezifumanekayo kwi-TTS.ai

Ukusuka kwisantya esikhawulezayo nesingenanto ukuya kwisandi se-neural sestudio-quality

KokoroKokoro

Free

Lightweight 82M parameter model delivering studio-quality speech with blazing-fast inference.

Fast 5/5

Elungileyo ku: Imodeli encinci yexesha elidlulileyo — ibonisa ukuba i-TTS ye-neural ifike phi

Zama Kokoro

BarkBark

Standard

Transformer-based text-to-audio model that generates realistic speech, music, and sound effects.

Slow 4/5

Elungileyo ku: Imodeli esekelwe kwi-transformer ebonisa ukwenziwa kwesandi ngaphandle kokuthetha

Zama Bark

CosyVoice 2CosyVoice 2

Standard

Alibaba's scalable streaming TTS with human-parity naturalness and near-zero latency.

Medium 5/5 I-Voice Cloning

Elungileyo ku: Ukusasazwa kwe-TTS ngexabiso elifanayo lomgangatho we-human kunye nokuklonywa kwe-zero-shot

Zama CosyVoice 2

ChatterboxChatterbox

Premium

State-of-the-art zero-shot voice cloning with emotion control from Resemble AI.

Medium 5/5 I-Voice Cloning

Elungileyo ku: Uklonelo lwesandi esingenanto esibonisa umda wesandi esidityanisiweyo

Zama Chatterbox

Tortoise TTSTortoise TTS

Premium

Multi-voice text-to-speech focused on quality with autoregressive architecture.

Slow 5/5 I-Voice Cloning

Elungileyo ku: Uyilo oluya ezantsi oluzenzekelayo olunika kuqala umgangatho wesandi ophezulu

Zama Tortoise TTS

Indlela i-Neural TTS isebenza ngayo

Inkqubo ye-synthesizer yokuthetha yexesha elidlulileyo kwinyathelo elinesine

1

Ukuqonda isiseko

I-TTS iguqula umbhalo obhalwe kwisandi esithethayo. Iinkqubo zexesha elidlulileyo zisebenzisa amajelo engqondo aqeqeshwe kumawaka eyure zoshicilelo lwesandi somuntu.

2

Khangela iimodeli ezahlukeneyo

Imodeli nganye ye-TTS isebenzisa uyilo oluhlukileyo (ukuguqula, ukusasaza, ukutshintsha) ngezinto ezinamandla ezikhethekileyo kwisantya, umgangatho, kunye nemisebenzi.

3

Zama ngokwakho

Indlela engcono yokuqonda i TTS kukuyisebenzisa. Zama iimodyuli zethu ezikhululekileyo ngasentla - dibanisa nayiphi na umbhalo kwaye uyiva ithetha kwimizuzu.

4

Yongeza kwi projekti yakho

Xa ufumana imodeli othanda ngayo, sebenzisa i-API yethu ukufaka i-TTS kwiinkqubo zakho, iimveliso, okanye uqhubekeko-msebenzi lokwenza imixholo.

Imbali emfutshane yoMbhalo ukuya kuSpeech

Ukusuka kumamatshini athethayo asebenza ngesandla ukuya kumanethiwekhi engqondo

Iintsuku zokuqala (1950s-1980s)

Ukuthetha kuqala okuveliswe yikhompyutha kwaqala ngo-1961, xa i-IBM

Iinkqubo eziphawulekayo: Votrax (1970s), DECtalk (1984, isetyenziswa ngu Stephen Hawking), Apple

Ukwenziwa kwe-Concatenate (1990s-2000s)

I-Concatenative TTS irekhoda ulwimi lwangempela lwengqondo ethetha amawaka onxibelelwano lwefonem, emva koko idibanisa iinkalo ezilungileyo ngexesha lokusebenza. Oku kwenza ulwimi oluvela ngasemva kodwa lufuna ii-databases ezikhulu (zihlala ziiyure ezili-10-20 zoshicilelo ngelizwi ngalinye). Ubunjani buxhomekeke kakhulu ekubeni ufumane uthungelwano olulula phakathi kweendawo.

Isetyenziswa yi: AT&T Natural Voices, Nuance Vocalizer, Google Translate TTS.

Iinkcukacha-manani/Iiparamitha (2000s-2010s)

Imodeli ye-HMMs (Hidden Markov Models) kunye ne-neural networks enzulu yabangela iparameters zelizwi (i-pitch, ixesha lokuphila, iimpawu ze-spectral) ezityebileyo kwi-vocoder. Oku kuvumela ukusetyenziswa kwegama elipheleleyo kunye nokwenza amagama alula, kodwa i-vocoder step ihlala ivelisa \

Iimodeli eziphambili: HTS, Merlin, iinkqubo eziqala kwi-DNN.

I-Neural TTS (2016-ngoku)

Ixesha elitsha laqala ngeWaveNet (DeepMind, 2016), eyenza isampuli yesandi ngesampuli usebenzisa iinkxaso ze-neural ezinzulu. Le nto yalandela iTacotron (Google, 2017), efunda ukudwelisa umbhalo ngqo kwi-spectrograms. Namhlanje

Iinkqubo eziphambili eziqhuba phambili: iWaveNet, iTacotron, iFastSpeech, iVITS, iBark, iKokoro.

Indlela i-TTS ye-Neural eNtsha esebenza ngayo

Uyilo lwangaphakathi lwesandi se-AI esiziva ngathi siyindalo

Uxwebhu

Umbhalo oqhelekileyo ucocekile kwaye uqhelekileyo: amanani aguqulwa ibe ngamagama (\

Imodeli ye-Acoustic (Umbhalo kwi-Spectrogram)

Imodeli yesandi (ibizwa ngokuba yiTransformer okanye inethiwekhi ejika-jika) ithatha ukulandelana kwefonem kwaye ilindele i-mel spectrogram — ukubonisa okubonakalayo kwendlela isandi esithetha ngayo.

I-Vocoder (Spectrogram ukuya kwi-Audio)

I-vocoder iguqula i mel spectrogram ibe ziifomu ze wave zesandi. I-vocoders edlulileyo njenge Griffin- Lim yenziwe ngezixhobo ze robotic. I-neural vocoders yexesha elidlulileyo (HiFi- GAN, BigVGAN, Vocos) yenza i-high- fidelity 24kHz okanye 44. 1kHz yesandi ethatha iinkcukacha ezilungileyo zokuthetha okuqhelekileyo, kubandakanya iingoma zomoya kunye neentlanganiso ezincinci zeliphu.

Iimodeli ezisiphelo-siphelo

Iimodeli ezintsha ezifana ne VITS, Kokoro, kunye ne Bark zishiya umbhobho wenqanaba elinye ngokupheleleyo. Ziya ngqo ukusuka kumbhalo ukuya kwisandi kwinethiwekhi ye-neural enye, zivelisa iziphumo ezininzi eziqhelekileyo ngee-artifacts ezincinci. Ezinye iimodeli (njenge Bark) zingavelisa iingoma ezingathethiyo, uxolo, kunye nengoma kunye nokuthetha.

Iindlela ze-TTS ezithelekiswa

Uhlobo lweenkqubo ezisetyenziswayo

Inkqubo Ixesha Ubuhle Ukulungelelaniswa Isantya I-Data Ifuneka
Formant Synthesis
Ukwenza imodeli yokuphindaphinda ngokusekwe kumthetho
1960s-1990s Akukho nanye
I-Concatenate
Iindawo zesandi ezidityanisiweyo
1990s-2010s 10-20+ iiyure
I-Parametric (HMM/DNN)
Iimodeli zokuthetha ze-statistics
2000s-2016 1-5 iiyure
I-Neural End-to-End
Ukufunda okunzulu (VITS, Kokoro, Bark)
2016-I-Present Imizuzwana ukuya kwiiyure

Iinkqubo eziqhelekileyo ze-TTS

Indawo apho umbhalo usetyenziswa khona namhlanje

Ufikelelo

Abafundi bekhusi, izixhobo ezincedisayo, kunye neezixhobo zabantu abanengxaki yokubona okanye abakhubazekileyo ekufundeni baxhomekeke kwi-TTS ukwenza ukuba imixholo yedijithali ifikeleleke kuwo wonke umntu.

Ukwenza imixholo

Abavelisi beYouTube, abavelisi beepodcast, kunye nabavelisi bemithombo yeendaba ephakathi basebenzisa i-TTS kwi-voiceovers, ukuxelela, kunye nokuvelisa okuzenzekelayo kwezinto eziqulethe imixholo kwinqanaba.

Abancedisi Ababonakalayo

I-Siri, i-Alexa, i-Google Assistant, kunye ne-chatbots yenkonzo yomthengi zonke zisebenzisa i-TTS ukuphendula ngokuqhelekileyo kubasebenzisi.

Imibuzo ebuzwa rhoqo

Imibuzo ebuzwa rhoqo malunga neteknoloji yokuguqula umbhalo ube ngumbhalo

I-TTS imele Umbhalo-ukuya-ku-Ukuthetha. Ibhekisa kwitekhnoloji eguqula umbhalo obhaliweyo waba ngamalizwi athethayo athethayo usebenzisa i-synthesized okanye i-AI-eyenziweyo. Igama lisetyenziswa ngokutshintshiselana nge "ukuthetha okudityanisiweyo" kwincwadi yezobugcisa.

Iinkqubo ze TTS zamanye amazwe zisebenza kwiintshukumo ezintathu: uvavanyo lombhalo (ukuzoba, ukugqiba, ukuguqula ilizwi), ukulinganisa iprosody (ukumisela irythm, i-pitch, i-stress, kunye nokuphumla), kunye nokwenziwa kwesandi (ukuvelisa i-waveform yesandi). Iimodeli ze-neural zifunda zonke iintshukumo ezintathu ukusuka kwi-data yoqeqesho.

I-Concatenate TTS idibanisa kunye neengcango zokuthetha ezirekhodiweyo ngaphambili, ezinokuthi zivune zingqubane ngexesha lokutshintshiselana. I-Neural TTS ivelisa ukuthetha ukusuka ekuqaleni usebenzisa ukufunda okunzulu, ivelisa isandi esicocekileyo, esidlangalaleni esidlangalaleni kunye ne-prosody engcono kunye ne-emotions.

SSML (Igama elibhalwe phantsi loLwimi lweSpeech Synthesis) yi XML- esekelwe kwilwimi lophawu ekuvumela ukuba ulawula indlela inkqubo ye TTS ethetha ngayo umbhalo. Ungakhankanya izithuba, uxinzelelo, uthetha, utshintsho lwendawo, kunye nokulinganisa kokuthetha usebenzisa i SSML tags ngaphakathi kongeniso lombhalo wakho.

I-TTS isetyenziswa ukufezekisa (abafundi bekhusi abakhubazekileyo bokubona), abancedisi ababonakalayo (iSiri, iAlexa, iGoogle Assistant), ukwenziwa kweencwadi zesandi, ukufunda nge-e-learning, ukuqhubela phambili nge-GPS, inkqubo yenkonzo yomthengi ye-IVR, ukudala imixholo, kunye nokusetyenziswa kweenkqubo zokufunda ulwimi.

I-TTS yaphuhliswa ukusuka kwinkqubo esekelwe kwimithetho yeroboti kwiminyaka ye-1960, ukuya kwinkqubo yokudibanisa i-concatenative synthesization kwiminyaka ye-1990, ukuya kwinkqubo yokudibanisa i-statistical parametric synthesization kwiminyaka ye-2000, ukuya kwi-neural TTS ne-WaveNet ngo-2016, ukuya kwimodeli ye-transformer ne-diffusion yemihla ngemihla efumana umgangatho wenqanaba lomuntu.

I-TTS edlala ngokuqhelekileyo ifuna i-prosody echanekileyo (i-rhythm, i-stress, intonation), ukukhawuleza okufanelekileyo, ukutshintshwa okuthambileyo phakathi kweefonem, kunye nobume obuzinzileyo belizwi. Iimodeli ze-neural zifunda ezi mpawu kwi-dataset enkulu yoshicilelo lwelizwi lengqondo.

Iimodeli zokudubula ilizwi ezifana ne Chatterbox ne CosyVoice 2 zinokudubula ilizwi elikhankanyiweyo ukusuka kwimizuzu emi-5-30 yesandi esibhekisa kuyo. Ilizwi elidubulayo lifumana i-timbre, isivakalisi, kunye nesitayile sokuthetha, nangona ukuziphatha kunye nokuqwalaselwa komthetho kusetyenziswa ukudubula ilizwi labanye.

Iimodeli ze TTS ezitsha zixhasa ngokunjalo iilwimi ezingaphezu kwe-30. Ezinye iimodeli zijolise kwiilwimi ezithile ngelixa ezinye ziyiilwimi ezininzi. IsiNgesi sineemodeli kunye neesandi ezifumanekayo kakhulu, kodwa isiTshayina, isiJaphani, isiKorea, isiSpanyol, kunye neelwimi zaseYuropu zixhaswa kakuhle.

I-TTS yinxalenye yohlobo lwesandi se-AI. I-TTS iguqula ngokukodwa ungeniso lombhalo kwimveliso yokuthetha. Uhlobo lwesandi se-AI luhlobo olubanzi oluquka ukuclona kwesandi, ukuguqula ilizwi, ukuthetha-ukuthetha, kunye nohlobo lwesiphumo sesandi.

Kuxhomekeke kwiimfuno zakho. I-Kokoro inikezela ngokulingana okulungileyo kwesantya kunye nomgangatho wokusetyenziswa ngokubanzi. Ibhokisi yencoko yababini iqhuba ukuklonya kwelizwi. I-Orpheus i excels kwindlela yokubonisa evakalelwayo. I-StyleTTS 2 ivelisa ukuthetha-thethana okwenyani womthumeli omnye. Akukho "elungileyo" imodeli yezinye iziganeko zokusetyenziswa.

Ewe. Zonke iimodyuli kwi-TTS.ai zi open-source kwaye zingagcinwa zisebenze. Iimodyuli ze-CPU kuphela ezinjengePiper zisebenza kwikhompyutha nayiphi na. Iimodyuli ze-GPU ezinjengeKokoro neBark zifuna i-NVIDIA GPU ene-2-8GB VRAM. Inkqubo yethu ibonelela ngokufikelela okugcinwe ukuze ungasebenzisi ulawulo lwenkqubo.
5.0/5 (1)

Yintoni esinokuyilungisa? Ulwazi lwakho olufunyenweyo lunceda silungise iingxaki.

Imboniselo ye TTS eNtsha

Zama iimodeli zesandi ze-AI ezikwinqanaba le-20+ ngaphandle kwexabiso. Bona ukuba umbhalo ufike phi kukuthetha.