Virtuoso: Massive Multilingual Speech-Text Joint Semi-Supervised Learning for Text-to-Speech

This paper proposes Virtuoso, a massively multilingual speech-text joint semi-supervised learning framework for text-to-speech synthesis (TTS) models. Existing multilingual TTS typically supports tens of languages, which are a small fraction of the thousands of languages in the world. One difficulty...

Full description

Saved in:

Bibliographic Details
Published in:	ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) pp. 1 - 5
Main Authors:	Saeki, Takaaki, Zen, Heiga, Chen, Zhehuai, Morioka, Nobuyuki, Wang, Gary, Zhang, Yu, Bapna, Ankur, Rosenberg, Andrew, Ramabhadran, Bhuvana
Format:	Conference Proceeding
Language:	English
Published:	IEEE 04-06-2023
Subjects:	Acoustics Data models massive multilingual pretraining Multilingual text-to-speech synthesis Semisupervised learning Signal processing Speech synthesis speech-text semi-supervised joint learning Task analysis Training
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Abstract	This paper proposes Virtuoso, a massively multilingual speech-text joint semi-supervised learning framework for text-to-speech synthesis (TTS) models. Existing multilingual TTS typically supports tens of languages, which are a small fraction of the thousands of languages in the world. One difficulty to scale multilingual TTS to hundreds of languages is collecting high-quality speech-text paired data in low-resource languages. This study extends Maestro, a speech-text joint pretraining framework for automatic speech recognition (ASR), to speech generation tasks. To train a TTS model from various types of speech and text data, different training schemes are designed to handle supervised (paired TTS and ASR data) and unsupervised (untranscribed speech and unspoken text) datasets. Experimental evaluation shows that 1) multilingual TTS models trained on Virtuoso can achieve significantly better naturalness and intelligibility than baseline ones in seen languages, and 2) they can synthesize reasonably intelligible and naturally sounding speech for unseen languages where no high-quality paired TTS data is available.
AbstractList	This paper proposes Virtuoso, a massively multilingual speech-text joint semi-supervised learning framework for text-to-speech synthesis (TTS) models. Existing multilingual TTS typically supports tens of languages, which are a small fraction of the thousands of languages in the world. One difficulty to scale multilingual TTS to hundreds of languages is collecting high-quality speech-text paired data in low-resource languages. This study extends Maestro, a speech-text joint pretraining framework for automatic speech recognition (ASR), to speech generation tasks. To train a TTS model from various types of speech and text data, different training schemes are designed to handle supervised (paired TTS and ASR data) and unsupervised (untranscribed speech and unspoken text) datasets. Experimental evaluation shows that 1) multilingual TTS models trained on Virtuoso can achieve significantly better naturalness and intelligibility than baseline ones in seen languages, and 2) they can synthesize reasonably intelligible and naturally sounding speech for unseen languages where no high-quality paired TTS data is available.
Author	Ramabhadran, Bhuvana Morioka, Nobuyuki Saeki, Takaaki Zhang, Yu Bapna, Ankur Wang, Gary Zen, Heiga Chen, Zhehuai Rosenberg, Andrew
Author_xml	– sequence: 1 givenname: Takaaki surname: Saeki fullname: Saeki, Takaaki email: takaaki_saeki@ipc.i.u-tokyo.ac.jp organization: Google,Japan – sequence: 2 givenname: Heiga surname: Zen fullname: Zen, Heiga email: heigazen@google.com organization: Google,Japan – sequence: 3 givenname: Zhehuai surname: Chen fullname: Chen, Zhehuai email: zhehuai@google.com organization: Google,USA – sequence: 4 givenname: Nobuyuki surname: Morioka fullname: Morioka, Nobuyuki organization: Google,Japan – sequence: 5 givenname: Gary surname: Wang fullname: Wang, Gary organization: Google,USA – sequence: 6 givenname: Yu surname: Zhang fullname: Zhang, Yu organization: Google,USA – sequence: 7 givenname: Ankur surname: Bapna fullname: Bapna, Ankur organization: Google,USA – sequence: 8 givenname: Andrew surname: Rosenberg fullname: Rosenberg, Andrew organization: Google,USA – sequence: 9 givenname: Bhuvana surname: Ramabhadran fullname: Ramabhadran, Bhuvana organization: Google,USA
BookMark	eNo1kF1LwzAYhaMouE3_gRfxB2S-SZom8U6G84MNhU7xQhhheaORri1NO_TfW5lenZvnHB7OmBxVdYWEXHCYcg728n52XRRPmZVKTwUIOeUAVmkQB2TMtTA8l0LrQzISUlvGLbyekHFKnwBgdGZG5O0ltl1fp_qKLl1KcYd02ZddLGP13ruSFg3i5oOt8KujD3WsOlrgNrKib7DdxYSeLtC11UDTULf0l2Ndzfa1U3IcXJnw7C8n5Hl-s5rdscXj7WC-YJEPqsw5udHoQ25MHoRQDnLpg_ccPQ8yEwIGVbOxGKR3VmiXeQtGKVA5Ko8gJ-R8vxsRcd20ceva7_X_FfIH8u5XAA
ContentType	Conference Proceeding
DBID	6IE 6IH CBEJK RIE RIO
DOI	10.1109/ICASSP49357.2023.10095702
DatabaseName	IEEE Electronic Library (IEL) Conference Proceedings IEEE Proceedings Order Plan (POP) 1998-present by volume IEEE Xplore All Conference Proceedings IEEE Electronic Library Online IEEE Proceedings Order Plans (POP) 1998-present
DatabaseTitleList
Database_xml	– sequence: 1 dbid: RIE name: IEEE Electronic Library Online url: http://ieeexplore.ieee.org/Xplore/DynWel.jsp sourceTypes: Publisher
DeliveryMethod	fulltext_linktorsrc
Discipline	Engineering
EISBN	1728163277 9781728163277
EISSN	2379-190X
EndPage	5
ExternalDocumentID	10095702
Genre	orig-research
GroupedDBID	23M 6IE 6IF 6IH 6IK 6IL 6IM 6IN AAJGR ABLEC ACGFS ADZIZ ALMA_UNASSIGNED_HOLDINGS BEFXN BFFAM BGNUA BKEBE BPEOZ CBEJK CHZPO IEGSK IJVOP IPLJI JC5 M43 OCL RIE RIL RIO RNS
ID	FETCH-LOGICAL-i1702-aa3c7edf6886f225a063dfdd1ed1f342207488c9ef3da927a4d90855056e5de03
IEDL.DBID	RIE
IngestDate	Wed Jun 26 19:24:05 EDT 2024
IsDoiOpenAccess	false
IsOpenAccess	true
IsPeerReviewed	false
IsScholarly	true
Language	English
LinkModel	DirectLink
MergedId	FETCHMERGED-LOGICAL-i1702-aa3c7edf6886f225a063dfdd1ed1f342207488c9ef3da927a4d90855056e5de03
OpenAccessLink	https://doi.org/10.1109/icassp49357.2023.10095702
PageCount	5
ParticipantIDs	ieee_primary_10095702
PublicationCentury	2000
PublicationDate	2023-June-4
PublicationDateYYYYMMDD	2023-06-04
PublicationDate_xml	– month: 06 year: 2023 text: 2023-June-4 day: 04
PublicationDecade	2020
PublicationTitle	ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
PublicationTitleAbbrev	ICASSP
PublicationYear	2023
Publisher	IEEE
Publisher_xml	– name: IEEE
SSID	ssj0008748
Score	2.2874856
Snippet	This paper proposes Virtuoso, a massively multilingual speech-text joint semi-supervised learning framework for text-to-speech synthesis (TTS) models. Existing...
SourceID	ieee
SourceType	Publisher
StartPage	1
SubjectTerms	Acoustics Data models massive multilingual pretraining Multilingual text-to-speech synthesis Semisupervised learning Signal processing Speech synthesis speech-text semi-supervised joint learning Task analysis Training
Title	Virtuoso: Massive Multilingual Speech-Text Joint Semi-Supervised Learning for Text-to-Speech
URI	https://ieeexplore.ieee.org/document/10095702
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
link	http://sdu.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1dS8MwFA1uD6Ivfk38JoKvmWvTLcmjzI0pKINO8UEYaXKjBW3L1v5_b7oP9cEH30ppaMgl556095xLyFWCKcRpoZh1AWeRsgFLbOCY6Vnoat-JW3u98ygWjy_yduBtcthaCwMAdfEZtP1l_S_f5qbyn8pwhyMhEN46siGUXIi11rArRSQ3yeXSRPP6rn8Tx-NI8a5o-xbh7dXgX21U6iwy3Pnn-3dJ61uPR8frTLNHNiDbJ9s_rAQPyOtzOiurHM_L9AH5MGIYrbW1Xm1e6Q8aFwDmnU0Qi-l9nmYljeEzZXFVeLCYg6VLp9U3ijSW-udYmbPFsBZ5Gg4m_RFbdk5gaYDzY1pzI8C6npQ9hztWIxGxztoAMA48CkMkDlIaBY5brUKhI6t8wRqyIeha6PBD0szyDI4ITZQJnVEIBMJEkie646QOQCDTMDLh4TFp-YWaFgtzjOlqjU7-uH9Ktnw46mqr6Iw0y1kF56Qxt9VFHc8v45KhtA
link.rule.ids	310,311,782,786,791,792,798,23939,23940,25149,27934,54767
linkProvider	IEEE
linkToHtml	http://sdu.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1bS8MwFA46wcuLt4l3I_iaubbpkjzK3Nh0G4NO8UEYaXKiBW3H1v5_k-6iPvjgWygJhBzyna_t-b6D0E1sU4iRTBBtvIBQoT0Sa88Q1dAQSteJWzq9cydigxd-33I2OWSlhQGAsvgMam5Y_svXmSrcpzJ7wy0hYM46ciOkrMHmcq0V8HJG-Sa6Xtho3nabd1E0pCIIWc01Ca8tl_9qpFLmkfbuP3ewh6rfijw8XOWafbQG6QHa-WEmeIhen5NpXmT2jRn3LSO2KIZLda3TmxfyA0cTAPVORhaN8UOWpDmO4DMhUTFxcDEDjRdeq2_YElns5pE8I_NlVfTUbo2aHbLonUASz-6PSBkoBto0OG8Ye2elpSLaaO2BjURAfd9SB86VABNoKXwmqRauZM3yIQg11IMjVEmzFI4RjoXyjRIWCpiiPIhl3XDpAbNcQ_E48E9Q1R3UeDK3xxgvz-j0j-dXaKsz6vfGve7g8Qxtu9CUtVf0HFXyaQEXaH2mi8sytl-WLaUF
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=proceeding&rft.title=ICASSP+2023+-+2023+IEEE+International+Conference+on+Acoustics%2C+Speech+and+Signal+Processing+%28ICASSP%29&rft.atitle=Virtuoso%3A+Massive+Multilingual+Speech-Text+Joint+Semi-Supervised+Learning+for+Text-to-Speech&rft.au=Saeki%2C+Takaaki&rft.au=Zen%2C+Heiga&rft.au=Chen%2C+Zhehuai&rft.au=Morioka%2C+Nobuyuki&rft.date=2023-06-04&rft.pub=IEEE&rft.eissn=2379-190X&rft.spage=1&rft.epage=5&rft_id=info:doi/10.1109%2FICASSP49357.2023.10095702&rft.externalDocID=10095702