Yini i-Text to Speech (TTS)?

I-Text to Speech iyi-technology eguqula umbhalo obhalwe ube yi-audio ekhulumayo usebenzisa i-artificial intelligence. Kusuka kuma-robotic synthesizers angaphambili kuya kuma-neural networks amanje azwakala angafani namuntu, i-TTS iguqule indlela esixhumana ngayo ne-technology, sisebenzisa okuqukethwe, futhi sinikeza ulwazi olufinyelelekayo.

i-Technology Imbali Indlela isebenza ngayo Inethiwekhi ye-neural I-Evolution

Izinhlamvu ezibalulekile embhalweni kuya kumazwi

Ukuqonda amabhloko wokuthuthukisa ukuthuthukiswa kwezwi

Igama le TTS

I-TTS ibhekisa ku-Text-to-Speech — i-technology eguqula umbhalo obhalwe ibe yisandi esikhulumayo usebenzisa amagama akhiqizwa yikhompyutha.

Indlela i-Neural TTS isebenza ngayo

I-TTS yamanje isebenzisa amanethiwekhi e-neural aqinile ukuhlola umbhalo, ukubikezela imikhuba yokukhuluma, kanye nokukhiqiza ama-waveforms omsindo azwakala afana nomuntu.

Imbali yohlelo lokuhlela ulwimi

Kusuka kuma-1960s ama-systems asekelwe kumthetho kuya kuma-1990s i-concatenative synthesis kuya kumamodeli e-neural anamuhla - indlela i-TTS eguqukile eminyakeni emihlanu.

Amamodeli we-AI amanje

Imodeli yesikhathi samanje njengeKokoro, Bark, neCosyVoice 2 isebenzisa ama-transformers, ukusakazwa, kanye nokubikezela ukuhlukahluka ukuze kufinyelelwe kukhwalithi yezwi lezinga lomuntu.

Izisebenziso ezijwayelekile

I-TTS inamandla okufundela isikrini, ukuzula kwe-GPS, izisebenzi ezibonakalayo, amabhuku esandiso, ama-bots wenkonzo yomthengi, ama-e-learning platforms, kanye nokuthuthukiswa kwesihloko.

Umthombo ovulekile vs. Ukuthengiswa

Imodeli yomthombo ovulekile (MIT, Apache 2.0) inikeza i-TTS ekhululekile, ekwazi ukuhoxiswa ngokwezifiso ngenkathi izinsizakalo zebhizinisi zinikeza i-API elawulwa nge-SLAs nexhaso.

Amamodeli we-TTS atholakala ku-TTS.ai

Kusuka kukhawulela futhi kuncane kuya ku-studio-quality neural voices

KokoroKokoro

Free

Lightweight 82M parameter model delivering studio-quality speech with blazing-fast inference.

Fast 5/5

Okungcono kakhulu: Imodeli encane ye-state-of-the-art — ibonisa ukuthi iphi ubude be-neural TTS efika

Zama Kokoro

BarkBark

Standard

Transformer-based text-to-audio model that generates realistic speech, music, and sound effects.

Slow 4/5

Okungcono kakhulu: Imodeli esekelwe ku-transformer ebonisa ukukhishwa komsindo ngaphandle kokukhuluma

Zama Bark

CosyVoice 2CosyVoice 2

Standard

Alibaba's scalable streaming TTS with human-parity naturalness and near-zero latency.

Medium 5/5 Ukulungiswa kwezwi

Okungcono kakhulu: Ukusakazwa kwe-TTS ngekhwalithi ye-human-parity kanye nokuklonywa kwe-zero-shot

Zama CosyVoice 2

ChatterboxChatterbox

Premium

State-of-the-art zero-shot voice cloning with emotion control from Resemble AI.

Medium 5/5 Ukulungiswa kwezwi

Okungcono kakhulu: Ukuklonywa kwezwi losuku-zero kubonisa umkhawulo wezinhlanganisela zezwi

Zama Chatterbox

Tortoise TTSTortoise TTS

Premium

Multi-voice text-to-speech focused on quality with autoregressive architecture.

Slow 5/5 Ukulungiswa kwezwi

Okungcono kakhulu: Ukwakhiwa okuzenzakalelayo okubuyisela emuva okuhlinzeka ngempumelelo enkulu yomsindo

Zama Tortoise TTS

Indlela i-Neural TTS isebenza ngayo

I-synthesizer yezwi elimanje elihamba ngezinyawo

1

Funda izimiso

I-TTS iguqula umbhalo obhalwe ube yisandi esikhulumayo. Amasistimu amanje asebenzisa amanethiwekhi engqondo aqeqeshiwe emahoreni amakhulu okukhuluma komuntu.

2

Thola amamodeli ahlukene

Imodeli ngayinye ye-TTS isebenzisa ukwakhiwa okuhlukile (ukuguqula, ukusakazwa, ukushintshana) ngezinto ezinamandla ezihlukile ezihamba ngokushesha, ezisezingeni eliphakeme, nezici.

3

Zama wena

Indlela engcono kakhulu yokuqonda i-TTS ukusebenzisa. Zama amamodeli ethu amahhala angezansi — chofoza noma iyiphi incwadi bese uyilalela ikhuluma emaminithini.

4

Ihlanganisa nezinhlelo zakho

Uma uthola imodeli othanda ngayo, sebenzisa i-API yethu ukufaka i-TTS kumasevisi akho, imikhiqizo, noma ukukhishwa kwesihloko.

Imbali encane ye-Text to Speech

Kusuka kuma-machines akhuluma nge-mechanical kuya kumanethiwekhi e-neural

Izinsuku zokuqala (1950s-1980s)

Ukukhuluma okusondela ku-computer kuqala kwaqala ngo-1961, lapho i-IBM

Amasistimu aphawulekayo: Votrax (1970s), DECtalk (1984, asetshenziswa nguStephen Hawking), Apple

Ukuhlanganisa isizinda (1990s-2000s)

I-Concatenate TTS irekhoda umsindo womntu okhuluma amawaka ezingxube ze-phoneme, bese ihlanganisa amasekhondi afanele ngesikhathi sokuqhuba. Le nqubo ikhiqize umsindo ozwakalayo kodwa idinga ama-databases akhulu (okungenani amahora angama-10-20 wokurekhoda ngomsindo ngamunye). Umgangatho uxhomekeke kakhulu ekutholeni ukuxhuma okuhlanzekile phakathi kwamasekhondi.

Isetshenziswa ngu: AT&T Natural Voices, Nuance Vocalizer, Google Translate TTS eqaleni.

Isibalo/Isibalo se-Parametric (2000s-2010s)

Ngezansi kwe-stitching recordings, amamodeli we-parametric afunde ukubonakaliswa kwe-statistics kwezwi. Amamodeli we-Hidden Markov (HMMs) kanye nenethiwekhi ebanzi ye-neural yadala amapharamitha wezwi (i-pitch, isikhathi, izici ze-spectral) afakwa nge-vocoder. Lokhu kwavumela amagama ambalwa angekho emthethweni nokwenza amagama alula, kodwa i-vocoder step ivame ukukhiqiza \

Imodeli ebalulekile: HTS, Merlin, amasistimu aqala nge-DNN.

I-Neural TTS (2016-ngoku)

Isikhathi esimanje saqala nge-WaveNet (DeepMind, 2016), esikhiqiza isampula le-audio ngesampula usebenzisa amanethiwekhi angaphakathi e-neural. Le nto yalandela i-Tacotron (Google, 2017), efunda ukudweba umbhalo ngokuqondile kuma-spectrograms. Namhlanje

Izixazululo ezibalulekile: iWaveNet, iTacotron, iFastSpeech, iVITS, iBark, iKokoro.

Indlela i-Modern Neural TTS isebenza ngayo

Ukwakhiwa kwezwi le-AI elizwakalayo

Ukuhlaziywa kwetekisi nokulungiswa

Umbhalo omnyama uhlanzwe futhi ujwayelekile: amanani ashintsha abe amagama (\

Imodeli ye-acoustic (umbhalo ku-spectrogram)

Imodeli ye-acoustic (evame ukuba yi-Transformer noma inethiwekhi eqhubekayo) ithatha ukulandelana kwefonime futhi ibikezela i-mel spectrogram — ukubonakaliswa okubonakalayo kwendlela i-audio isebenza ngayo

I-Vocoder (Spectrogram kuya kumsindo)

I-vocoder iguqula i-mel spectrogram ibe yi-audio waveforms. I-vocoders edlule njenge-Griffin-Lim yakhiqiza ama-robotic artifacts. I-neural vocoders yamanje (HiFi-GAN, BigVGAN, Vocos) ikhiqiza i-high-fidelity 24kHz noma i-44.1kHz audio eqoqa i-fine details yezwi elijwayelekile, kufaka phakathi ama-sounds of breath and subtle lip movements.

Amamodeli okuphela-ku-kuphela

Amamodeli amanje njenge-VITS, i-Kokoro, ne-Bark ashiya isigaba-simbili se-pipeline ngokugcwele. Bahamba ngokuqondile kusuka ku-text kuya ku-audio kwinethiwekhi eyodwa ye-neural, ekhiqiza imiphumela eminingi ejwayelekile nge-artifacts encane. Amamodeli athile (njenge-Bark) angakhiqiza ama-non-speech sounds, laughs, ne-music alongside speech.

Uhlelo lwe-TTS oluqhathaniswa

Indlela ezizukulwane ezine ze-TTS technology ziqhathaniswa ngayo

Uhlelo Isikhathi Ubuhle Ukulungiswa Isivinini Idinga ukufakwa
Isingeniso se-Formant
Ukumodelwa kwezinga lokusakazwa ngokuya ngemithetho
1960s-1990s Akunalutho
I-Concatenate
Amasekhondi omsindo axhunywe
1990s-2010s Amahora angama-10-20+
I-parametric (HMM/DNN)
Imodeli yokukhuluma ye-statistics
2000s-2016 Amahora angama-1-5
I-Neural End-to-End
Ukufundeka okunzulu (VITS, Kokoro, Bark)
2016-Ikhona Amaminithi kuya emahorani

Izisebenziso ezijwayelekile ze-TTS

Indawo lapho umbhalo usetshenziswa khona ukuxoxa namuhla

Ufinyeleleka

Abafundi besiga-nyezi, amathuluzi okusiza, namathuluzi abantu abanezinkinga zokubona noma ukufundela basebenzisa i-TTS ukwenza okuqukethwe kwedijithali kube lula kuwo wonke umuntu.

Ukwakha okuqukethwe

Abadali be-YouTube, ababhali be-podcast, kanye nabakhiqizi bemidiya yomphakathi basebenzisa i-TTS ukusakaza, ukuchaza, kanye nokukhishwa kwezinto eziqukethwe ngokuzenzakalela.

Ama-Asistents abonakalayo

Siri, Alexa, Google Assistant, kanye nezinsizakalo zekhasimende chatbots zonke zisebenzisa i-TTS ukuphendula ngokuvamile kubasebenzisi.

Imibuzo ebuzwa kaningi

Imibuzo ebuzwa kaningi mayelana ne-text-to-speech technology

TTS ibhekisa ku-Text-to-Speech. Ibhekisa kwi-technology eguqula umbhalo obhalwe ngesandla ube yimagama akhulumayo azwakalayo usebenzisa amazwi akhiqizwe nge-synthesized noma AI. Igama lisetshenziswa ngokushintshana ne-"speech synthesis" encwadini yezobuchwepheshe.

Izinhlelo ze-TTS ezimanje zisebenza ezinyangeni ezintathu: ukuhlaziywa kwesihloko (ukuhlukaniswa, ukujwayelekile, ukuguquka kwefoni), ukubikezela kwe-prosody (ukucacisa i-rythm, i-pitch, i-stress, kanye nokuphumula), kanye nokukhiqizwa kwesandi (ukukhiqiza i-waveform yomsindo okhona). Amamodeli e-neural afunda zonke izinyanga ezintathu kusuka kudatha yokuqeqeshwa.

I-Concatenate TTS ixhuma i-pre-recorded speech fragments, engakwazi ukulalela i-choppy ekuguqulweni. I-Neural TTS ikhiqiza ulwimi kusuka ekuqaleni usebenzisa ukufunda okunzulu, ikhiqiza umsindo osheshayo, ovela ngokwemvelo nge-prosody engcono ne-emotions.

SSML (Isilimi Sokuphawula Sokuhlanganisela Sokuxoxa) yisilimi sokufaka uphawu esisekelwe ku-XML esikuvumela ukuthi ulawula indlela amasistimu we-TTS achaza ngayo umbhalo. Ungacacisa iziqephu, ukuphawula, ukuphawula, ushintsho lwe-pitch, kanye nezinga lokukhuluma usebenzisa amathegi we-SSML ngaphakathi kokufaka umbhalo wakho.

I-TTS isetshenziswa ukufinyeleleka (abafundi be-screen abakhubazekile), abasizakazi ababonakalayo (iSiri, iAlexa, i-Google Assistant), ukukhishwa kwencwadi yezwi, ukufunda nge-e-learning, ukuzula kwe-GPS, ama-systems wenkonzo yomthengi IVR, ukwakha okuqukethwe, kanye nezinhlelo zokufunda ulwimi.

I-TTS yavela ku-robotic rule-based systems eminyakeni yama-1960, yafinyelela ekusetshenzisweni okuxhumeneyo eminyakeni yama-1990, yafinyelela ekusetshenzisweni kwe-statistical parametric eminyakeni yama-2000, yafinyelela ku-neural TTS ne-WaveNet ngonyaka we-2016, yafinyelela kumamodeli okuguqula nokusabalalisa anamuhla afinyelela kukhwalithi yezinga lomuntu.

I-TTS ezwakalayo ejwayelekile idinga i-prosody efanele (i-rhythm, i-stress, intonation), ukuqapha okufanele, ukudluliswa okunethezeka phakathi kwama-phonemes, kanye nokuqonda okuqinile kwezwi. Amamodeli e-neural afunda lezi zimo kusuka kuma-dataset amakhulu wezingxoxo zabantu ezijwayelekile.

Izinhlelo zokuklonya umsindo ezifana ne-Chatterbox ne-CosyVoice 2 zingaphindela emuva umsindo othile kusuka kumasekondi angama-5-30 we-reference audio. Umsindo ohlonywe uthatha i-timbre, i-accent, kanye nesitayela sokukhuluma, naphezu kokuthi ukuziphatha kanye nokucatshangelwa komthetho kusetshenziswa ukuklonya umsindo wabanye.

Amamodeli we-TTS amanje axhasa ngokuhlanganyela izilimi ezingaphezu kuka-30. Ezinye izilimi zikhethekile ezilimi ezithile kanti ezinye ziyizilimi eziningi. IsiNgisi sinezinhlobo eziningi ezitholakalayo nezizwi, kodwa isiShayina, isiJalimane, isiKorea, isiShayina, nezilimi zase-Europe zixhaswa kahle.

I-TTS iyingxenye yokukhiqizwa kwezwi le-AI. I-TTS iguqula ngokukhethekile ukungeniswa kwesihloko ku-ukuphuma kwezwi. Ukukhiqizwa kwezwi le-AI kuyigama elibanzi elifaka ukuklonyeliswa kwezwi, ukuguqulwa kwezwi, ukukhuluma-ukukhuluma, kanye nokukhiqizwa kwemiphumela yomsindo.

Kuxhomekeka kuzidingo zakho. I-Kokoro inikeza ukulinganisela okuhle kakhulu kwejubane nekhwalithi yokusetshenziswa okujwayelekile. I-Chatterbox iholela ekukloneni kwezwi. I-Orpheus ihamba phambili ekubonakaliseni okunengqondo. I-StyleTTS 2 ikhiqiza ukuchaza okujwayelekile kakhulu komsindo wesikhulumi. Akukho "imodeli engcono" eyodwa yezinkinga zonke zokusetshenziswa.

Yebo. Zonke imodeli ku-TTS.ai zivulekile futhi zingagcinwa. Imodeli ye-CPU kuphela njenge-Piper isebenza kunoma iyiphi ikhompyutha. Imodeli ye-GPU njenge-Kokoro ne-Bark idinga i-NVIDIA GPU ene-2-8GB VRAM. I-platform yethu inikeza futhi ukufinyelela okugcinwe ukuze ungasebenzisi izinhlelo zokusebenza.
5.0/5 (1)

Yini esingayithuthukisa? Umbono wakho usiza ukuxazulula izinkinga.

Ukwazi i-TTS yamanje ngokwakho

Zama amamodeli wokukhuluma angaphezu kuka-20 asezingeni eliphakeme le-AI mahhala. Bona ukuthi umbhalo ufike kanjani embhalweni.