Combining Conformer and Dual-Path-Transformer Networks for Single Channel Noisy Reverberant Speech Separation
Separation of overlapping speakers remains an active area of speech technology research. Many deep neural network (DNN) separation models propose modelling local and global temporal context separately using alternating DNN layers. Two such models are SepFormer and TD-Conformer. The largest configura...
Saved in:
Published in: | ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) pp. 11491 - 11495 |
---|---|
Main Authors: | , , |
Format: | Conference Proceeding |
Language: | English |
Published: |
IEEE
14-04-2024
|
Subjects: | |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | Separation of overlapping speakers remains an active area of speech technology research. Many deep neural network (DNN) separation models propose modelling local and global temporal context separately using alternating DNN layers. Two such models are SepFormer and TD-Conformer. The largest configurations of each have comparable computational cost and similar performance; with SepFormer performing better on anechoic data and TD-Conformer yielding better results on noisy reverberant data. This work combines these two model types to gain insights into how their computational characteristics affect their performance. The generalization benefits of the larger model size of the conformer layers are demonstrated both on the WHAMR and the out-of-domain far-field evaluation set MC-WSJ-AV across a number of evaluation metrics. The proposed model is able to achieve 22.1 dB and 14.7 dB average scale-invariant signal-to-distortion ratio (SISDR) improvement when trained and evaluated on WSJ0-2Mix and WHAMR, respectively. The model trained using WHAMR is able to achieve 4.3 dB average SISDR improvement on the out-of-domain MC-WSJ-AV dataset. |
---|---|
ISSN: | 2379-190X |
DOI: | 10.1109/ICASSP48485.2024.10447644 |