FastPitchFormant: Source-filter based Decomposed Modeling for Speech Synthesis
Methods for modeling and controlling prosody with acoustic features have been proposed for neural text-to-speech (TTS) models. Prosodic speech can be generated by conditioning acoustic features. However, synthesized speech with a large pitch-shift scale suffers from audio quality degradation, and sp...
Saved in:
Main Authors: | , , , , |
---|---|
Format: | Journal Article |
Language: | English |
Published: |
29-06-2021
|
Subjects: | |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | Methods for modeling and controlling prosody with acoustic features have been
proposed for neural text-to-speech (TTS) models. Prosodic speech can be
generated by conditioning acoustic features. However, synthesized speech with a
large pitch-shift scale suffers from audio quality degradation, and speaker
characteristics deformation. To address this problem, we propose a feed-forward
Transformer based TTS model that is designed based on the source-filter theory.
This model, called FastPitchFormant, has a unique structure that handles text
and acoustic features in parallel. With modeling each feature separately, the
tendency that the model learns the relationship between two features can be
mitigated. |
---|---|
DOI: | 10.48550/arxiv.2106.15123 |