Virtuoso: Massive Multilingual Speech-Text Joint Semi-Supervised Learning for Text-to-Speech
This paper proposes Virtuoso, a massively multilingual speech-text joint semi-supervised learning framework for text-to-speech synthesis (TTS) models. Existing multilingual TTS typically supports tens of languages, which are a small fraction of the thousands of languages in the world. One difficulty...
Saved in:
Published in: | ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) pp. 1 - 5 |
---|---|
Main Authors: | , , , , , , , , |
Format: | Conference Proceeding |
Language: | English |
Published: |
IEEE
04-06-2023
|
Subjects: | |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Abstract | This paper proposes Virtuoso, a massively multilingual speech-text joint semi-supervised learning framework for text-to-speech synthesis (TTS) models. Existing multilingual TTS typically supports tens of languages, which are a small fraction of the thousands of languages in the world. One difficulty to scale multilingual TTS to hundreds of languages is collecting high-quality speech-text paired data in low-resource languages. This study extends Maestro, a speech-text joint pretraining framework for automatic speech recognition (ASR), to speech generation tasks. To train a TTS model from various types of speech and text data, different training schemes are designed to handle supervised (paired TTS and ASR data) and unsupervised (untranscribed speech and unspoken text) datasets. Experimental evaluation shows that 1) multilingual TTS models trained on Virtuoso can achieve significantly better naturalness and intelligibility than baseline ones in seen languages, and 2) they can synthesize reasonably intelligible and naturally sounding speech for unseen languages where no high-quality paired TTS data is available. |
---|---|
AbstractList | This paper proposes Virtuoso, a massively multilingual speech-text joint semi-supervised learning framework for text-to-speech synthesis (TTS) models. Existing multilingual TTS typically supports tens of languages, which are a small fraction of the thousands of languages in the world. One difficulty to scale multilingual TTS to hundreds of languages is collecting high-quality speech-text paired data in low-resource languages. This study extends Maestro, a speech-text joint pretraining framework for automatic speech recognition (ASR), to speech generation tasks. To train a TTS model from various types of speech and text data, different training schemes are designed to handle supervised (paired TTS and ASR data) and unsupervised (untranscribed speech and unspoken text) datasets. Experimental evaluation shows that 1) multilingual TTS models trained on Virtuoso can achieve significantly better naturalness and intelligibility than baseline ones in seen languages, and 2) they can synthesize reasonably intelligible and naturally sounding speech for unseen languages where no high-quality paired TTS data is available. |
Author | Ramabhadran, Bhuvana Morioka, Nobuyuki Saeki, Takaaki Zhang, Yu Bapna, Ankur Wang, Gary Zen, Heiga Chen, Zhehuai Rosenberg, Andrew |
Author_xml | – sequence: 1 givenname: Takaaki surname: Saeki fullname: Saeki, Takaaki email: takaaki_saeki@ipc.i.u-tokyo.ac.jp organization: Google,Japan – sequence: 2 givenname: Heiga surname: Zen fullname: Zen, Heiga email: heigazen@google.com organization: Google,Japan – sequence: 3 givenname: Zhehuai surname: Chen fullname: Chen, Zhehuai email: zhehuai@google.com organization: Google,USA – sequence: 4 givenname: Nobuyuki surname: Morioka fullname: Morioka, Nobuyuki organization: Google,Japan – sequence: 5 givenname: Gary surname: Wang fullname: Wang, Gary organization: Google,USA – sequence: 6 givenname: Yu surname: Zhang fullname: Zhang, Yu organization: Google,USA – sequence: 7 givenname: Ankur surname: Bapna fullname: Bapna, Ankur organization: Google,USA – sequence: 8 givenname: Andrew surname: Rosenberg fullname: Rosenberg, Andrew organization: Google,USA – sequence: 9 givenname: Bhuvana surname: Ramabhadran fullname: Ramabhadran, Bhuvana organization: Google,USA |
BookMark | eNo1kF1LwzAYhaMouE3_gRfxB2S-SZom8U6G84MNhU7xQhhheaORri1NO_TfW5lenZvnHB7OmBxVdYWEXHCYcg728n52XRRPmZVKTwUIOeUAVmkQB2TMtTA8l0LrQzISUlvGLbyekHFKnwBgdGZG5O0ltl1fp_qKLl1KcYd02ZddLGP13ruSFg3i5oOt8KujD3WsOlrgNrKib7DdxYSeLtC11UDTULf0l2Ndzfa1U3IcXJnw7C8n5Hl-s5rdscXj7WC-YJEPqsw5udHoQ25MHoRQDnLpg_ccPQ8yEwIGVbOxGKR3VmiXeQtGKVA5Ko8gJ-R8vxsRcd20ceva7_X_FfIH8u5XAA |
ContentType | Conference Proceeding |
DBID | 6IE 6IH CBEJK RIE RIO |
DOI | 10.1109/ICASSP49357.2023.10095702 |
DatabaseName | IEEE Electronic Library (IEL) Conference Proceedings IEEE Proceedings Order Plan (POP) 1998-present by volume IEEE Xplore All Conference Proceedings IEEE Electronic Library Online IEEE Proceedings Order Plans (POP) 1998-present |
DatabaseTitleList | |
Database_xml | – sequence: 1 dbid: RIE name: IEEE Electronic Library Online url: http://ieeexplore.ieee.org/Xplore/DynWel.jsp sourceTypes: Publisher |
DeliveryMethod | fulltext_linktorsrc |
Discipline | Engineering |
EISBN | 1728163277 9781728163277 |
EISSN | 2379-190X |
EndPage | 5 |
ExternalDocumentID | 10095702 |
Genre | orig-research |
GroupedDBID | 23M 6IE 6IF 6IH 6IK 6IL 6IM 6IN AAJGR ABLEC ACGFS ADZIZ ALMA_UNASSIGNED_HOLDINGS BEFXN BFFAM BGNUA BKEBE BPEOZ CBEJK CHZPO IEGSK IJVOP IPLJI JC5 M43 OCL RIE RIL RIO RNS |
ID | FETCH-LOGICAL-i1702-aa3c7edf6886f225a063dfdd1ed1f342207488c9ef3da927a4d90855056e5de03 |
IEDL.DBID | RIE |
IngestDate | Wed Jun 26 19:24:05 EDT 2024 |
IsDoiOpenAccess | false |
IsOpenAccess | true |
IsPeerReviewed | false |
IsScholarly | true |
Language | English |
LinkModel | DirectLink |
MergedId | FETCHMERGED-LOGICAL-i1702-aa3c7edf6886f225a063dfdd1ed1f342207488c9ef3da927a4d90855056e5de03 |
OpenAccessLink | https://doi.org/10.1109/icassp49357.2023.10095702 |
PageCount | 5 |
ParticipantIDs | ieee_primary_10095702 |
PublicationCentury | 2000 |
PublicationDate | 2023-June-4 |
PublicationDateYYYYMMDD | 2023-06-04 |
PublicationDate_xml | – month: 06 year: 2023 text: 2023-June-4 day: 04 |
PublicationDecade | 2020 |
PublicationTitle | ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) |
PublicationTitleAbbrev | ICASSP |
PublicationYear | 2023 |
Publisher | IEEE |
Publisher_xml | – name: IEEE |
SSID | ssj0008748 |
Score | 2.2874856 |
Snippet | This paper proposes Virtuoso, a massively multilingual speech-text joint semi-supervised learning framework for text-to-speech synthesis (TTS) models. Existing... |
SourceID | ieee |
SourceType | Publisher |
StartPage | 1 |
SubjectTerms | Acoustics Data models massive multilingual pretraining Multilingual text-to-speech synthesis Semisupervised learning Signal processing Speech synthesis speech-text semi-supervised joint learning Task analysis Training |
Title | Virtuoso: Massive Multilingual Speech-Text Joint Semi-Supervised Learning for Text-to-Speech |
URI | https://ieeexplore.ieee.org/document/10095702 |
hasFullText | 1 |
inHoldings | 1 |
isFullTextHit | |
isPrint | |
link | http://sdu.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1dS8MwFA1uD6Ivfk38JoKvmWvTLcmjzI0pKINO8UEYaXKjBW3L1v5_b7oP9cEH30ppaMgl556095xLyFWCKcRpoZh1AWeRsgFLbOCY6Vnoat-JW3u98ygWjy_yduBtcthaCwMAdfEZtP1l_S_f5qbyn8pwhyMhEN46siGUXIi11rArRSQ3yeXSRPP6rn8Tx-NI8a5o-xbh7dXgX21U6iwy3Pnn-3dJ61uPR8frTLNHNiDbJ9s_rAQPyOtzOiurHM_L9AH5MGIYrbW1Xm1e6Q8aFwDmnU0Qi-l9nmYljeEzZXFVeLCYg6VLp9U3ijSW-udYmbPFsBZ5Gg4m_RFbdk5gaYDzY1pzI8C6npQ9hztWIxGxztoAMA48CkMkDlIaBY5brUKhI6t8wRqyIeha6PBD0szyDI4ITZQJnVEIBMJEkie646QOQCDTMDLh4TFp-YWaFgtzjOlqjU7-uH9Ktnw46mqr6Iw0y1kF56Qxt9VFHc8v45KhtA |
link.rule.ids | 310,311,782,786,791,792,798,23939,23940,25149,27934,54767 |
linkProvider | IEEE |
linkToHtml | http://sdu.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1bS8MwFA46wcuLt4l3I_iaubbpkjzK3Nh0G4NO8UEYaXKiBW3H1v5_k-6iPvjgWygJhBzyna_t-b6D0E1sU4iRTBBtvIBQoT0Sa88Q1dAQSteJWzq9cydigxd-33I2OWSlhQGAsvgMam5Y_svXmSrcpzJ7wy0hYM46ciOkrMHmcq0V8HJG-Sa6Xtho3nabd1E0pCIIWc01Ca8tl_9qpFLmkfbuP3ewh6rfijw8XOWafbQG6QHa-WEmeIhen5NpXmT2jRn3LSO2KIZLda3TmxfyA0cTAPVORhaN8UOWpDmO4DMhUTFxcDEDjRdeq2_YElns5pE8I_NlVfTUbo2aHbLonUASz-6PSBkoBto0OG8Ye2elpSLaaO2BjURAfd9SB86VABNoKXwmqRauZM3yIQg11IMjVEmzFI4RjoXyjRIWCpiiPIhl3XDpAbNcQ_E48E9Q1R3UeDK3xxgvz-j0j-dXaKsz6vfGve7g8Qxtu9CUtVf0HFXyaQEXaH2mi8sytl-WLaUF |
openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=proceeding&rft.title=ICASSP+2023+-+2023+IEEE+International+Conference+on+Acoustics%2C+Speech+and+Signal+Processing+%28ICASSP%29&rft.atitle=Virtuoso%3A+Massive+Multilingual+Speech-Text+Joint+Semi-Supervised+Learning+for+Text-to-Speech&rft.au=Saeki%2C+Takaaki&rft.au=Zen%2C+Heiga&rft.au=Chen%2C+Zhehuai&rft.au=Morioka%2C+Nobuyuki&rft.date=2023-06-04&rft.pub=IEEE&rft.eissn=2379-190X&rft.spage=1&rft.epage=5&rft_id=info:doi/10.1109%2FICASSP49357.2023.10095702&rft.externalDocID=10095702 |