Emilia: a speech corpus for Argentine Spanish text to speech synthesis

This paper introduces Emilia, a speech corpus created to build a female voice in Spanish spoken in Buenos Aires for the Aromo text-to-speech system. Aromo is a unit selection text-to-speech system, which employs diphones as units of synthesis. The key requirements and design criteria for Emilia were...

Full description

Saved in:

Bibliographic Details
Published in:	Language Resources and Evaluation Vol. 53; no. 3; pp. 419 - 447
Main Authors:	Torres, Humberto M., Gurlekian, Jorge A., Evin, Diego A., Mercado, Christian G. Cossio
Format:	Journal Article
Language:	English
Published:	Dordrecht Springer 01-09-2019 Springer Netherlands Springer Nature B.V
Subjects:	Computational Linguistics Computer Science Corpus linguistics Diphones Language and Literature Linguistics Original Paper Phonetics Prosody Sentences Social Sciences Spanish language Speech perception Speech recognition Speech synthesis Stress Syllables Syntactic complexity Syntactic structures Syntax phonology relationship Synthesis Text-to-speech Voice simulation Buenos Aires Argentina Argentina Phonetic corpus Argentine Spanish Speech corpus design Phonetic transcription Text-to-speech
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	This paper introduces Emilia, a speech corpus created to build a female voice in Spanish spoken in Buenos Aires for the Aromo text-to-speech system. Aromo is a unit selection text-to-speech system, which employs diphones as units of synthesis. The key requirements and design criteria for Emilia were: to synthesize any text in Spanish into high-quality speech with a minimum corpus size. The text corpus was designed to guarantee the phonetic and prosodie coverage. A three-stage strategy was used: in the first stage, 741 sentences were designed with all of the syllables of Spanish spoken in Argentina, with and without stress, and in all positions within the word; in the second stage, 852 sentences were added to balance out the distribution of the diphones; and after a perceptual evaluation of the quality ofsynthesized speech, in the third and final stage, 625 sentences were added to achieve the specified unit coverage, and to introduce sentences with more complex syntactic and prosodie structures. Issues from all three corpus building stages are reported. The paper also presents the results from the quality perceptual evaluations of the synthesized voice. Emilia has a duration of three hours and 15 minutes; its speech quality synthesized with Aromo system is similar to the level obtained with commercial systems, with a real-time ratio less than one.
ISSN:	1574-020X 1572-8412 1574-0218
DOI:	10.1007/s10579-019-09447-7