Integration of Optimized Modulation Filter Sets Into Deep Neural Networks for Automatic Speech Recognition
Inspired by physiological studies on the human auditory system and by results from psychoacoustics, an amplitude modulation filter bank (AMFB) has been developed and successfully applied to feature extraction for automatic speech recognition (ASR) in earlier work. Here, we address the question as to...
Saved in:
Published in: | IEEE/ACM transactions on audio, speech, and language processing Vol. 24; no. 12; pp. 2439 - 2452 |
---|---|
Main Authors: | , , |
Format: | Journal Article |
Language: | English |
Published: |
Piscataway
IEEE
01-12-2016
The Institute of Electrical and Electronics Engineers, Inc. (IEEE) |
Subjects: | |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | Inspired by physiological studies on the human auditory system and by results from psychoacoustics, an amplitude modulation filter bank (AMFB) has been developed and successfully applied to feature extraction for automatic speech recognition (ASR) in earlier work. Here, we address the question as to which amplitude modulation (AM) frequency decomposition leads to optimal ASR performance by proposing a parameterized functional relationship between modulation center frequency and modulation bandwidth. Word error rates (WERs) of ASR experiments with 1551 different AMFBs are systematically evaluated and compared, resulting in the identification of a comparatively narrow range of optimal modulation frequency to modulation bandwidth characteristics. To integrate modulation processing with deep neural network (DNN) acoustic modeling, we propose merging of modulation filter coefficients with DNN weights prior to a final training step and an improved mean-variance normalization scheme for AMFBs. These modifications are shown to result in further reduction of WERs and are indicative of the proposed system's improved generalization ability, when compared across corpora of 100-960 h of data with mismatched training and test conditions. Analysis of DNN-learned temporal AM filtering properties is carried out and implications for the relevance of different modulation regions as well as the relation to psychoacoustic findings are discussed. ASR experiments with the proposed system demonstrate a high degree of robustness against extrinsic acoustic distortions, resulting in, e.g., an average WER of 9.79% on the Aurora-4 task. |
---|---|
Bibliography: | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23 |
ISSN: | 2329-9290 2329-9304 |
DOI: | 10.1109/TASLP.2016.2615239 |