Performance monitoring for automatic speech recognition in noisy multi-channel environments

In many applications of machine listening it is useful to know how well an automatic speech recognition system will do before the actual recognition is performed. In this study we investigate different performance measures with the aim of predicting word error rates (WERs) in spatial acoustic scenes...

Full description

Saved in:
Bibliographic Details
Published in:2016 IEEE Spoken Language Technology Workshop (SLT) pp. 50 - 56
Main Authors: Meyer, Bernd T., Mallidi, Sri Harish, Castro Martinez, Angel Mario, Paya-Vaya, Guillermo, Kayser, Hendrik, Hermansky, Hynek
Format: Conference Proceeding
Language:English
Published: IEEE 01-12-2016
Subjects:
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:In many applications of machine listening it is useful to know how well an automatic speech recognition system will do before the actual recognition is performed. In this study we investigate different performance measures with the aim of predicting word error rates (WERs) in spatial acoustic scenes in which the type of noise, the signal-to-noise ratio, parameters for spatial filtering, and the amount of reverberation are varied. All measures under consideration are based on phoneme posteriorgrams obtained from a deep neural net. While frame-wise entropy exhibits only medium predictive power for factors other than additive noise, we found the medium temporal distance between posterior vectors (M-Measure) as well as matched phoneme filters (MaP) to exhibit excellent correlations with WER across all conditions. Since our results were obtained with simulated behind-the-ear hearing aid signals, we discuss possible applications for speech-aware hearing devices.
DOI:10.1109/SLT.2016.7846244