Environmentally robust ASR front-end for deep neural network acoustic models

•Effects of various front-end schemes are examined using DNN acoustic models.•Meeting transcription experiments are conducted using a single distant microphone.•Both speaker independent/adaptive configurations are considered.•A pipeline is proposed to integrate different classes of front-end schemes...

Full description

Saved in:
Bibliographic Details
Published in:Computer speech & language Vol. 31; no. 1; pp. 65 - 86
Main Authors: Yoshioka, T., Gales, M.J.F.
Format: Journal Article
Language:English
Published: Elsevier Ltd 01-05-2015
Subjects:
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:•Effects of various front-end schemes are examined using DNN acoustic models.•Meeting transcription experiments are conducted using a single distant microphone.•Both speaker independent/adaptive configurations are considered.•A pipeline is proposed to integrate different classes of front-end schemes.•The pipeline is used to analyse the way in which different schemes interact. This paper examines the individual and combined impacts of various front-end approaches on the performance of deep neural network (DNN) based speech recognition systems in distant talking situations, where acoustic environmental distortion degrades the recognition performance. Training of a DNN-based acoustic model consists of generation of state alignments followed by learning the network parameters. This paper first shows that the network parameters are more sensitive to the speech quality than the alignments and thus this stage requires improvement. Then, various front-end robustness approaches to addressing this problem are categorised based on functionality. The degree to which each class of approaches impacts the performance of DNN-based acoustic models is examined experimentally. Based on the results, a front-end processing pipeline is proposed for efficiently combining different classes of approaches. Using this front-end, the combined effects of different classes of approaches are further evaluated in a single distant microphone-based meeting transcription task with both speaker independent (SI) and speaker adaptive training (SAT) set-ups. By combining multiple speech enhancement results, multiple types of features, and feature transformation, the front-end shows relative performance gains of 7.24% and 9.83% in the SI and SAT scenarios, respectively, over competitive DNN-based systems using log mel-filter bank features.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 23
ISSN:0885-2308
1095-8363
DOI:10.1016/j.csl.2014.11.008