Optimization of Speaker-Aware Multichannel Speech Extraction with ASR Criterion

This paper addresses the problem of recognizing speech corrupted by overlapping speakers in a multichannel setting. To extract a target speaker from the mixture, we use a neural network based beamformer which uses masks estimated by a neural network to compute statistically optimal spatial filters....

Full description

Saved in:
Bibliographic Details
Published in:2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) pp. 6702 - 6706
Main Authors: Zmolikova, Katerina, Delcroix, Marc, Kinoshita, Keisuke, Higuchi, Takuya, Nakatani, Tomohiro, Cernocky, Jan
Format: Conference Proceeding
Language:English
Published: IEEE 01-04-2018
Subjects:
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:This paper addresses the problem of recognizing speech corrupted by overlapping speakers in a multichannel setting. To extract a target speaker from the mixture, we use a neural network based beamformer which uses masks estimated by a neural network to compute statistically optimal spatial filters. Following our previous work, we inform the neural network about the target speaker using information extracted from an adaptation utterance' enabling the network to track the target speaker. While in the previous work, this method was used to separately extract the speaker and then pass such preprocessed speech to a speech recognition system, here we explore training both systems jointly with a common speech recognition criterion. We show that integrating the two systems and training for the final objective improves the performance. In addition, the integration enables further sharing of information between the acoustic model and the speaker extraction system, by making use of the predicted HMM-state posteriors to refine the masks used for beamforming.
ISSN:2379-190X
DOI:10.1109/ICASSP.2018.8461533