Improving Mandarin Tone Recognition Based on DNN by Combining Acoustic and Articulatory Features Using Extended Recognition Networks

In this paper, we investigate the effectiveness of articulatory information for Mandarin tone modeling and recognition in a deep neural network – hidden Markov model (DNN-HMM) framework. In conventional approaches, prosodic evidence (e.g., F0, duration and energy) is used to build tone classifiers,...

Full description

Saved in:

Bibliographic Details
Published in:	Journal of signal processing systems Vol. 90; no. 7; pp. 1077 - 1087
Main Authors:	Lin, Ju, Li, Wei, Gao, Yingming, Xie, Yanlu, Chen, Nancy F., Siniscalchi, Sabato Marco, Zhang, Jinsong, Lee, Chin-Hui
Format:	Journal Article
Language:	English
Published:	New York Springer US 01-07-2018 Springer Nature B.V
Subjects:	Artificial neural networks Circuits and Systems Computer Imaging Electrical Engineering Engineering Error reduction Errors Feature recognition Image Processing and Computer Vision Markov chains Modelling Neural networks Pattern Recognition Pattern Recognition and Graphics Performance enhancement Signal,Image and Speech Processing Vision Deep neural network Articulatory features Posterior probabilities Mandarin tone recognition MFCC Tone-based extended recognition network
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	In this paper, we investigate the effectiveness of articulatory information for Mandarin tone modeling and recognition in a deep neural network – hidden Markov model (DNN-HMM) framework. In conventional approaches, prosodic evidence (e.g., F0, duration and energy) is used to build tone classifiers, we here propose performance enhancement techniques in three areas: (i) adding articulatory features (AFs) and acoustic features, such as MFCCs (Mel frequency cepstrum coefficients), for tone modeling; (ii) adopting phone-dependent tone modeling; and (iii) using tone-based extended recognition network (ERN) to reduce the tone search space. The first approach is feature-related, it explicitly employs the AFs as a form of tonal features and is implemented through a multi-stage procedure. The second approach is model-related and directly extends to phone-dependent tone modeling so that each modeling unit (e.g., tonal phone) not only contains tone information, but also integrates the phone/articulatory information. Finally, the third technique is search-related with a phone-dependent tone-based expanding searching network. A series of comprehensive experiments is conducted using different input feature sets. It is demonstrated that (i) tone recognition accuracy is boosted by incorporating articulatory information, and (ii) ERN, attains the lowest tone error rate of 7.17%, with a 56% relative error reduction from the prosody-only baseline system error of 16.36%.
ISSN:	1939-8018 1939-8115
DOI:	10.1007/s11265-018-1334-2