FastPitchFormant: Source-filter based Decomposed Modeling for Speech Synthesis

Methods for modeling and controlling prosody with acoustic features have been proposed for neural text-to-speech (TTS) models. Prosodic speech can be generated by conditioning acoustic features. However, synthesized speech with a large pitch-shift scale suffers from audio quality degradation, and sp...

Full description

Saved in:
Bibliographic Details
Main Authors: Bak, Taejun, Bae, Jae-Sung, Bae, Hanbin, Kim, Young-Ik, Cho, Hoon-Young
Format: Journal Article
Language:English
Published: 29-06-2021
Subjects:
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Methods for modeling and controlling prosody with acoustic features have been proposed for neural text-to-speech (TTS) models. Prosodic speech can be generated by conditioning acoustic features. However, synthesized speech with a large pitch-shift scale suffers from audio quality degradation, and speaker characteristics deformation. To address this problem, we propose a feed-forward Transformer based TTS model that is designed based on the source-filter theory. This model, called FastPitchFormant, has a unique structure that handles text and acoustic features in parallel. With modeling each feature separately, the tendency that the model learns the relationship between two features can be mitigated.
DOI:10.48550/arxiv.2106.15123