Multichannel KHMF for speech separation with enthalpy based DOA and score based CNN (SCNN)

Multi-channel speech separation (SS) denotes the extraction of a multi-channel speaker's speech from the overlapping audio of the simultaneous speaker. So far, the use of visual modalities for multi-channel speech separation has shown great potential. The separation of multiple signals from the...

Full description

Saved in:
Bibliographic Details
Published in:Evolving systems Vol. 14; no. 3; pp. 501 - 518
Main Authors: Koteswararao, Yannam Vasantha, Rao, C. B. Rama
Format: Journal Article
Language:English
Published: Berlin/Heidelberg Springer Berlin Heidelberg 01-06-2023
Subjects:
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Multi-channel speech separation (SS) denotes the extraction of a multi-channel speaker's speech from the overlapping audio of the simultaneous speaker. So far, the use of visual modalities for multi-channel speech separation has shown great potential. The separation of multiple signals from their superposition recorded at several sensors is addressed. To overcome any existing drawbacks, this article proposes an effective method, such as the use of a novel hybrid method combining enthalpy-based direction of arrival (DOA) and krill herd-based matrix factorization (KHMF) to segment multi-channel speech signals, and Convolutional neural network (SCNN) estimation. First, calculate the short term Fourier transform (STFT) of the input signal. Then, the tracking branch starts to calculate the enthalpy of the analyzed signal. Enthalpy is the DOA-based spatial energy in each time frame. The Gaussian Mixture Model (GMM), which estimates the enthalpy function at each time frame, converts the spatial energy histogram into DOA measurements. Based on the output of the signal tracker, an enthalpy-based spatial covariance matrix model with DOA parameters is determined. Use multi-channel KHMF to estimate the spatial behavior of the source in time and the spectral model of the source from the tracking direction. Then, according to the spatial direction of the target speaker, effective features such as directivity and spatial features are extracted. Use score-based convolutional neural network (SCNN) relation masking. The STFT (iSTFT) operation is used to convert the generated speech spectrogram back to the extracted output signal. Experimental results show that our proposed approach accomplishes the most extreme SDR diff outcome of − 5 dB of 8.1. Comparable to the CTF-MINT, which achieves 8.05. The CTF-MPDR and CTF-BP had the SDR diff worst 7.71 and 7.4. The Unproc had the very worst SDR diff 5.71.
ISSN:1868-6478
1868-6486
DOI:10.1007/s12530-022-09473-x