A Multichannel MMSE-Based Framework for Speech Source Separation and Noise Reduction
We propose a new framework for joint multichannel speech source separation and acoustic noise reduction. In this framework, we start by formulating the minimum-mean-square error (MMSE)-based solution in the context of multiple simultaneous speakers and background noise, and outline the importance of...
Saved in:
Published in: | IEEE transactions on audio, speech, and language processing Vol. 21; no. 9; pp. 1913 - 1928 |
---|---|
Main Authors: | , , , , |
Format: | Journal Article |
Language: | English |
Published: |
Piscataway, NJ
IEEE
01-09-2013
Institute of Electrical and Electronics Engineers |
Subjects: | |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | We propose a new framework for joint multichannel speech source separation and acoustic noise reduction. In this framework, we start by formulating the minimum-mean-square error (MMSE)-based solution in the context of multiple simultaneous speakers and background noise, and outline the importance of the estimation of the activities of the speakers. The latter is accurately achieved by introducing a latent variable that takes N+1 possible discrete states for a mixture of N speech signals plus additive noise. Each state characterizes the dominance of one of the N+1 signals. We determine the posterior probability of this latent variable, and show how it plays a twofold role in the MMSE-based speech enhancement. First, it allows the extraction of the second order statistics of the noise and each of the speech signals from the noisy data. These statistics are needed to formulate the multichannel Wiener-based filters (including the minimum variance distortionless response). Second, it weighs the outputs of these linear filters to shape the spectral contents of the signals' estimates following the associated target speakers' activities. We use the spatial and spectral cues contained in the multichannel recordings of the sound mixtures to compute the posterior probability of this latent variable. The spatial cue is acquired by using the normalized observation vector whose distribution is well approximated by a Gaussian-mixture-like model, while the spectral cue can be captured by using a pre-trained Gaussian mixture model for the log-spectra of speech. The parameters of the investigated models and the speakers' activities (posterior probabilities of the different states of the latent variable) are estimated via expectation maximization. Experimental results including comparisons with the well-known independent component analysis and masking are provided to demonstrate the efficiency of the proposed framework. |
---|---|
Bibliography: | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23 |
ISSN: | 1558-7916 1558-7924 |
DOI: | 10.1109/TASL.2013.2263137 |