Combining Conformer and Dual-Path-Transformer Networks for Single Channel Noisy Reverberant Speech Separation

Separation of overlapping speakers remains an active area of speech technology research. Many deep neural network (DNN) separation models propose modelling local and global temporal context separately using alternating DNN layers. Two such models are SepFormer and TD-Conformer. The largest configura...

Full description

Saved in:
Bibliographic Details
Published in:ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) pp. 11491 - 11495
Main Authors: Ravenscroft, William, Goetze, Stefan, Hain, Thomas
Format: Conference Proceeding
Language:English
Published: IEEE 14-04-2024
Subjects:
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Separation of overlapping speakers remains an active area of speech technology research. Many deep neural network (DNN) separation models propose modelling local and global temporal context separately using alternating DNN layers. Two such models are SepFormer and TD-Conformer. The largest configurations of each have comparable computational cost and similar performance; with SepFormer performing better on anechoic data and TD-Conformer yielding better results on noisy reverberant data. This work combines these two model types to gain insights into how their computational characteristics affect their performance. The generalization benefits of the larger model size of the conformer layers are demonstrated both on the WHAMR and the out-of-domain far-field evaluation set MC-WSJ-AV across a number of evaluation metrics. The proposed model is able to achieve 22.1 dB and 14.7 dB average scale-invariant signal-to-distortion ratio (SISDR) improvement when trained and evaluated on WSJ0-2Mix and WHAMR, respectively. The model trained using WHAMR is able to achieve 4.3 dB average SISDR improvement on the out-of-domain MC-WSJ-AV dataset.
ISSN:2379-190X
DOI:10.1109/ICASSP48485.2024.10447644