Towards a multilingual prosody model for text-to-speech

The generation of prosodic parameters such as F0 contour, duration and intensity still remains an important issue for naturally-sounding text-to-speech (TTS), although recently developed TTS systems have achieved a considerable progress. Several appropriate but language-specific rule-based, statisti...

Full description

Saved in:

Bibliographic Details
Published in:	2002 IEEE International Conference on Acoustics, Speech, and Signal Processing Vol. 1; pp. I-421 - I-424
Main Authors:	Jokisch, Oliver, Ding, Hongwei, Kruschke, Hans
Format:	Conference Proceeding
Language:	English
Published:	IEEE 01-05-2002
Subjects:	Shape Speech Training
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	The generation of prosodic parameters such as F0 contour, duration and intensity still remains an important issue for naturally-sounding text-to-speech (TTS), although recently developed TTS systems have achieved a considerable progress. Several appropriate but language-specific rule-based, statistical or data-driven prosody models have been successfully realized in many systems. The language and parameter dependent models lead to a more complex and inefficient TTS system design. In earlier works the authors proposed a hybrid data-driven and rule-based model, which can adjust different voices or speaking styles by learning and predicting prosodic parameters. The current paper discusses the multilingual model generalization and the design of appropriate prosodic databases. Exemplary, two different languages: German and Mandarin Chinese are examined. Prediction results and perceptual evaluation with respect to F0 contours and duration values are presented. Since the perceptual results of both languages are comparable and quite satisfying, the model is qualified for the multilingual prosody control. Resynthesis stimuli obtained from modified prosodic parameters partly achieve near-to-natural mean opinion scores (MOS) above 4.0. The introduced hybrid data-driven and rule-based model is comparatively simple and enables a multilingual prosody control in TTS.
ISBN:	9780780374027 0780374029
ISSN:	1520-6149 2379-190X
DOI:	10.1109/ICASSP.2002.5743744