Virtuoso: Massive Multilingual Speech-Text Joint Semi-Supervised Learning for Text-to-Speech

This paper proposes Virtuoso, a massively multilingual speech-text joint semi-supervised learning framework for text-to-speech synthesis (TTS) models. Existing multilingual TTS typically supports tens of languages, which are a small fraction of the thousands of languages in the world. One difficulty...

Full description

Saved in:
Bibliographic Details
Published in:ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) pp. 1 - 5
Main Authors: Saeki, Takaaki, Zen, Heiga, Chen, Zhehuai, Morioka, Nobuyuki, Wang, Gary, Zhang, Yu, Bapna, Ankur, Rosenberg, Andrew, Ramabhadran, Bhuvana
Format: Conference Proceeding
Language:English
Published: IEEE 04-06-2023
Subjects:
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Abstract This paper proposes Virtuoso, a massively multilingual speech-text joint semi-supervised learning framework for text-to-speech synthesis (TTS) models. Existing multilingual TTS typically supports tens of languages, which are a small fraction of the thousands of languages in the world. One difficulty to scale multilingual TTS to hundreds of languages is collecting high-quality speech-text paired data in low-resource languages. This study extends Maestro, a speech-text joint pretraining framework for automatic speech recognition (ASR), to speech generation tasks. To train a TTS model from various types of speech and text data, different training schemes are designed to handle supervised (paired TTS and ASR data) and unsupervised (untranscribed speech and unspoken text) datasets. Experimental evaluation shows that 1) multilingual TTS models trained on Virtuoso can achieve significantly better naturalness and intelligibility than baseline ones in seen languages, and 2) they can synthesize reasonably intelligible and naturally sounding speech for unseen languages where no high-quality paired TTS data is available.
AbstractList This paper proposes Virtuoso, a massively multilingual speech-text joint semi-supervised learning framework for text-to-speech synthesis (TTS) models. Existing multilingual TTS typically supports tens of languages, which are a small fraction of the thousands of languages in the world. One difficulty to scale multilingual TTS to hundreds of languages is collecting high-quality speech-text paired data in low-resource languages. This study extends Maestro, a speech-text joint pretraining framework for automatic speech recognition (ASR), to speech generation tasks. To train a TTS model from various types of speech and text data, different training schemes are designed to handle supervised (paired TTS and ASR data) and unsupervised (untranscribed speech and unspoken text) datasets. Experimental evaluation shows that 1) multilingual TTS models trained on Virtuoso can achieve significantly better naturalness and intelligibility than baseline ones in seen languages, and 2) they can synthesize reasonably intelligible and naturally sounding speech for unseen languages where no high-quality paired TTS data is available.
Author Ramabhadran, Bhuvana
Morioka, Nobuyuki
Saeki, Takaaki
Zhang, Yu
Bapna, Ankur
Wang, Gary
Zen, Heiga
Chen, Zhehuai
Rosenberg, Andrew
Author_xml – sequence: 1
  givenname: Takaaki
  surname: Saeki
  fullname: Saeki, Takaaki
  email: takaaki_saeki@ipc.i.u-tokyo.ac.jp
  organization: Google,Japan
– sequence: 2
  givenname: Heiga
  surname: Zen
  fullname: Zen, Heiga
  email: heigazen@google.com
  organization: Google,Japan
– sequence: 3
  givenname: Zhehuai
  surname: Chen
  fullname: Chen, Zhehuai
  email: zhehuai@google.com
  organization: Google,USA
– sequence: 4
  givenname: Nobuyuki
  surname: Morioka
  fullname: Morioka, Nobuyuki
  organization: Google,Japan
– sequence: 5
  givenname: Gary
  surname: Wang
  fullname: Wang, Gary
  organization: Google,USA
– sequence: 6
  givenname: Yu
  surname: Zhang
  fullname: Zhang, Yu
  organization: Google,USA
– sequence: 7
  givenname: Ankur
  surname: Bapna
  fullname: Bapna, Ankur
  organization: Google,USA
– sequence: 8
  givenname: Andrew
  surname: Rosenberg
  fullname: Rosenberg, Andrew
  organization: Google,USA
– sequence: 9
  givenname: Bhuvana
  surname: Ramabhadran
  fullname: Ramabhadran, Bhuvana
  organization: Google,USA
BookMark eNo1kF1LwzAYhaMouE3_gRfxB2S-SZom8U6G84MNhU7xQhhheaORri1NO_TfW5lenZvnHB7OmBxVdYWEXHCYcg728n52XRRPmZVKTwUIOeUAVmkQB2TMtTA8l0LrQzISUlvGLbyekHFKnwBgdGZG5O0ltl1fp_qKLl1KcYd02ZddLGP13ruSFg3i5oOt8KujD3WsOlrgNrKib7DdxYSeLtC11UDTULf0l2Ndzfa1U3IcXJnw7C8n5Hl-s5rdscXj7WC-YJEPqsw5udHoQ25MHoRQDnLpg_ccPQ8yEwIGVbOxGKR3VmiXeQtGKVA5Ko8gJ-R8vxsRcd20ceva7_X_FfIH8u5XAA
ContentType Conference Proceeding
DBID 6IE
6IH
CBEJK
RIE
RIO
DOI 10.1109/ICASSP49357.2023.10095702
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Proceedings Order Plan (POP) 1998-present by volume
IEEE Xplore All Conference Proceedings
IEEE Electronic Library Online
IEEE Proceedings Order Plans (POP) 1998-present
DatabaseTitleList
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Electronic Library Online
  url: http://ieeexplore.ieee.org/Xplore/DynWel.jsp
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
Discipline Engineering
EISBN 1728163277
9781728163277
EISSN 2379-190X
EndPage 5
ExternalDocumentID 10095702
Genre orig-research
GroupedDBID 23M
6IE
6IF
6IH
6IK
6IL
6IM
6IN
AAJGR
ABLEC
ACGFS
ADZIZ
ALMA_UNASSIGNED_HOLDINGS
BEFXN
BFFAM
BGNUA
BKEBE
BPEOZ
CBEJK
CHZPO
IEGSK
IJVOP
IPLJI
JC5
M43
OCL
RIE
RIL
RIO
RNS
ID FETCH-LOGICAL-i1702-aa3c7edf6886f225a063dfdd1ed1f342207488c9ef3da927a4d90855056e5de03
IEDL.DBID RIE
IngestDate Wed Jun 26 19:24:05 EDT 2024
IsDoiOpenAccess false
IsOpenAccess true
IsPeerReviewed false
IsScholarly true
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-i1702-aa3c7edf6886f225a063dfdd1ed1f342207488c9ef3da927a4d90855056e5de03
OpenAccessLink https://doi.org/10.1109/icassp49357.2023.10095702
PageCount 5
ParticipantIDs ieee_primary_10095702
PublicationCentury 2000
PublicationDate 2023-June-4
PublicationDateYYYYMMDD 2023-06-04
PublicationDate_xml – month: 06
  year: 2023
  text: 2023-June-4
  day: 04
PublicationDecade 2020
PublicationTitle ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
PublicationTitleAbbrev ICASSP
PublicationYear 2023
Publisher IEEE
Publisher_xml – name: IEEE
SSID ssj0008748
Score 2.2874856
Snippet This paper proposes Virtuoso, a massively multilingual speech-text joint semi-supervised learning framework for text-to-speech synthesis (TTS) models. Existing...
SourceID ieee
SourceType Publisher
StartPage 1
SubjectTerms Acoustics
Data models
massive multilingual pretraining
Multilingual text-to-speech synthesis
Semisupervised learning
Signal processing
Speech synthesis
speech-text semi-supervised joint learning
Task analysis
Training
Title Virtuoso: Massive Multilingual Speech-Text Joint Semi-Supervised Learning for Text-to-Speech
URI https://ieeexplore.ieee.org/document/10095702
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://sdu.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1dS8MwFA1uD6Ivfk38JoKvmWvTLcmjzI0pKINO8UEYaXKjBW3L1v5_b7oP9cEH30ppaMgl556095xLyFWCKcRpoZh1AWeRsgFLbOCY6Vnoat-JW3u98ygWjy_yduBtcthaCwMAdfEZtP1l_S_f5qbyn8pwhyMhEN46siGUXIi11rArRSQ3yeXSRPP6rn8Tx-NI8a5o-xbh7dXgX21U6iwy3Pnn-3dJ61uPR8frTLNHNiDbJ9s_rAQPyOtzOiurHM_L9AH5MGIYrbW1Xm1e6Q8aFwDmnU0Qi-l9nmYljeEzZXFVeLCYg6VLp9U3ijSW-udYmbPFsBZ5Gg4m_RFbdk5gaYDzY1pzI8C6npQ9hztWIxGxztoAMA48CkMkDlIaBY5brUKhI6t8wRqyIeha6PBD0szyDI4ITZQJnVEIBMJEkie646QOQCDTMDLh4TFp-YWaFgtzjOlqjU7-uH9Ktnw46mqr6Iw0y1kF56Qxt9VFHc8v45KhtA
link.rule.ids 310,311,782,786,791,792,798,23939,23940,25149,27934,54767
linkProvider IEEE
linkToHtml http://sdu.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1bS8MwFA46wcuLt4l3I_iaubbpkjzK3Nh0G4NO8UEYaXKiBW3H1v5_k-6iPvjgWygJhBzyna_t-b6D0E1sU4iRTBBtvIBQoT0Sa88Q1dAQSteJWzq9cydigxd-33I2OWSlhQGAsvgMam5Y_svXmSrcpzJ7wy0hYM46ciOkrMHmcq0V8HJG-Sa6Xtho3nabd1E0pCIIWc01Ca8tl_9qpFLmkfbuP3ewh6rfijw8XOWafbQG6QHa-WEmeIhen5NpXmT2jRn3LSO2KIZLda3TmxfyA0cTAPVORhaN8UOWpDmO4DMhUTFxcDEDjRdeq2_YElns5pE8I_NlVfTUbo2aHbLonUASz-6PSBkoBto0OG8Ye2elpSLaaO2BjURAfd9SB86VABNoKXwmqRauZM3yIQg11IMjVEmzFI4RjoXyjRIWCpiiPIhl3XDpAbNcQ_E48E9Q1R3UeDK3xxgvz-j0j-dXaKsz6vfGve7g8Qxtu9CUtVf0HFXyaQEXaH2mi8sytl-WLaUF
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=proceeding&rft.title=ICASSP+2023+-+2023+IEEE+International+Conference+on+Acoustics%2C+Speech+and+Signal+Processing+%28ICASSP%29&rft.atitle=Virtuoso%3A+Massive+Multilingual+Speech-Text+Joint+Semi-Supervised+Learning+for+Text-to-Speech&rft.au=Saeki%2C+Takaaki&rft.au=Zen%2C+Heiga&rft.au=Chen%2C+Zhehuai&rft.au=Morioka%2C+Nobuyuki&rft.date=2023-06-04&rft.pub=IEEE&rft.eissn=2379-190X&rft.spage=1&rft.epage=5&rft_id=info:doi/10.1109%2FICASSP49357.2023.10095702&rft.externalDocID=10095702