Multi-Channel Speech Separation Using Spatially Selective Deep Non-Linear Filters
In a multi-channel separation task with multiple speakers, we aim to recover all individual speech signals from the mixture. In contrast to single-channel approaches, which rely on the different spectro-temporal characteristics of the speech signals, multi-channel approaches should additionally util...
Saved in:
Published in: | IEEE/ACM transactions on audio, speech, and language processing Vol. 32; pp. 542 - 553 |
---|---|
Main Authors: | , |
Format: | Journal Article |
Language: | English |
Published: |
Piscataway
IEEE
2024
The Institute of Electrical and Electronics Engineers, Inc. (IEEE) |
Subjects: | |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Abstract | In a multi-channel separation task with multiple speakers, we aim to recover all individual speech signals from the mixture. In contrast to single-channel approaches, which rely on the different spectro-temporal characteristics of the speech signals, multi-channel approaches should additionally utilize the different spatial locations of the sources for a more powerful separation especially when the number of sources increases. To enhance the spatial processing in a multi-channel source separation scenario, in this work, we propose a deep neural network (DNN) based spatially selective filter (SSF) that can be spatially steered to extract the speaker of interest by initializing a recurrent neural network layer with the target direction. We compare the proposed SSF with a common end-to-end direct separation (DS) approach trained using utterance-wise permutation invariant training (PIT), which only implicitly learns to perform spatial filtering. We show that the SSF has a clear advantage over a DS approach with the same underlying network architecture when there are more than two speakers in the mixture, which can be attributed to a better use of the spatial information. Furthermore, we find that the SSF generalizes much better to additional noise sources that were not seen during training and to scenarios with speakers positioned at a similar angle. |
---|---|
AbstractList | In a multi-channel separation task with multiple speakers, we aim to recover all individual speech signals from the mixture. In contrast to single-channel approaches, which rely on the different spectro-temporal characteristics of the speech signals, multi-channel approaches should additionally utilize the different spatial locations of the sources for a more powerful separation especially when the number of sources increases. To enhance the spatial processing in a multi-channel source separation scenario, in this work, we propose a deep neural network (DNN) based spatially selective filter (SSF) that can be spatially steered to extract the speaker of interest by initializing a recurrent neural network layer with the target direction. We compare the proposed SSF with a common end-to-end direct separation (DS) approach trained using utterance-wise permutation invariant training (PIT), which only implicitly learns to perform spatial filtering. We show that the SSF has a clear advantage over a DS approach with the same underlying network architecture when there are more than two speakers in the mixture, which can be attributed to a better use of the spatial information. Furthermore, we find that the SSF generalizes much better to additional noise sources that were not seen during training and to scenarios with speakers positioned at a similar angle. |
Author | Tesch, Kristina Gerkmann, Timo |
Author_xml | – sequence: 1 givenname: Kristina orcidid: 0000-0002-6458-8128 surname: Tesch fullname: Tesch, Kristina email: kristina.tesch@uni-hamburg.de organization: Signal Processing Group, Department of Informatics, Universität Hamburg, Hamburg, Germany – sequence: 2 givenname: Timo orcidid: 0000-0002-8678-4699 surname: Gerkmann fullname: Gerkmann, Timo email: timo.gerkmann@uni-hamburg.de organization: Signal Processing Group, Department of Informatics, Universität Hamburg, Hamburg, Germany |
BookMark | eNpNkFtLwzAYhoMoOOf-gHhR8LozpzbL5ZhOhXpi23VI068uI6Y16YT9ezs3wavv9L7vB88FOvWNB4SuCB4TguXtcroo3sYUUzZmjHGCyQkaUEZlKhnmp389lfgcjWLcYIwJFlIKPkDvz1vX2XS21t6DSxYtgFknC2h10J1tfLKK1n_0-37Szu36kwPT2W9I7gDa5KXxaWE96JDMresgxEt0VmsXYXSsQ7Sa3y9nj2nx-vA0mxapoVx0KTE5FsxkVQY6L7nkAuikxppVWnJKKkK5KaXIBM9NXWqOs6riJWYyK7OaS82G6OaQ24bmawuxU5tmG3z_UtGJlJRyJmmvogeVCU2MAWrVBvupw04RrPb01C89taenjvR60_XBZAHgn4FRkouc_QDoJGxy |
CODEN | ITASFA |
Cites_doi | 10.1109/SLT.2018.8639593 10.1109/ASRU.2015.7404829 10.21437/Interspeech.2017-187 10.1109/TASLP.2022.3221046 10.1109/ICASSP49357.2023.10095509 10.1109/ICASSP49357.2023.10096098 10.1109/TASLP.2015.2512042 10.1007/s11265-022-01770-7 10.21437/Interspeech.2022-10018 10.1186/s13634-017-0495-7 10.21437/Interspeech.2019-2266 10.1109/TASL.2011.2180896 10.1109/TSP.2004.831998 10.1109/CVPR.2015.7298935 10.1109/ICASSP.2018.8461930 10.1109/SLT48900.2021.9383522 10.1109/ICASSP.2017.7952242 10.1109/TASL.2013.2263137 10.1109/TASLP.2016.2647702 10.1109/ICASSP40776.2020.9053989 10.1109/TASLP.2021.3076372 10.1002/0471221104 10.1109/AVSS56176.2022.9959632 10.1109/TASLP.2018.2881912 10.1109/icassp.2018.8462081 10.1109/icassp.2019.8683855 10.1109/ICASSP.2018.8461639 10.1109/ICASSP49357.2023.10095970 10.1109/ICASSP.2018.8461310 10.1109/ICASSP43922.2022.9746108 10.1109/TASLP.2017.2726762 10.21437/Interspeech.2022-162 10.1121/1.382599 10.1109/TASL.2009.2016395 10.1002/0470031743 10.1109/ICASSP40776.2020.9053692 10.1109/ICASSP39728.2021.9414187 10.1109/ICASSP40776.2020.9053092 10.21437/Interspeech.2022-11153 10.1109/WASPAA.2019.8937218 10.1109/TASL.2010.2102754 10.1109/ICASSP40776.2020.9054177 10.23919/eusipco47968.2020.9287541 |
ContentType | Journal Article |
Copyright | Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2024 |
Copyright_xml | – notice: Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2024 |
DBID | 97E RIA RIE AAYXX CITATION 7SC 8FD JQ2 L7M L~C L~D |
DOI | 10.1109/TASLP.2023.3334101 |
DatabaseName | IEEE All-Society Periodicals Package (ASPP) 2005-present IEEE All-Society Periodicals Package (ASPP) 1998–Present IEEE Electronic Library Online CrossRef Computer and Information Systems Abstracts Technology Research Database ProQuest Computer Science Collection Advanced Technologies Database with Aerospace Computer and Information Systems Abstracts Academic Computer and Information Systems Abstracts Professional |
DatabaseTitle | CrossRef Computer and Information Systems Abstracts Technology Research Database Computer and Information Systems Abstracts – Academic Advanced Technologies Database with Aerospace ProQuest Computer Science Collection Computer and Information Systems Abstracts Professional |
DatabaseTitleList | Computer and Information Systems Abstracts |
Database_xml | – sequence: 1 dbid: RIE name: IEEE Electronic Library Online url: http://ieeexplore.ieee.org/Xplore/DynWel.jsp sourceTypes: Publisher |
DeliveryMethod | fulltext_linktorsrc |
Discipline | Engineering |
EISSN | 2329-9304 |
EndPage | 553 |
ExternalDocumentID | 10_1109_TASLP_2023_3334101 10321676 |
Genre | orig-research |
GrantInformation_xml | – fundername: Deutsche Forschungsgemeinschaft grantid: 508337379 funderid: 10.13039/501100001659 |
GroupedDBID | 0R~ 4.4 6IK 97E AAJGR AAKMM AALFJ AASAJ AAWTV ABQJQ ACIWK ACM ADBCU ADPZR AEBYY AENSD AFWIH AFWXC AIKLT AKJIK ALMA_UNASSIGNED_HOLDINGS BEFXN BFFAM BGNUA BKEBE BPEOZ CCLIF EBS EJD GUFHI HGAVV IFIPE IPLJI JAVBF LHSKQ M43 OCL PQQKQ RIA RIE RNS ROL AAYXX CITATION 7SC 8FD JQ2 L7M L~C L~D |
ID | FETCH-LOGICAL-c247t-1c6073c5d5ea6b4947e28f0a3da9421d124cb975746cfba405dd4b0395b5f49a3 |
IEDL.DBID | RIE |
ISSN | 2329-9290 |
IngestDate | Thu Oct 10 20:27:51 EDT 2024 Fri Aug 23 00:55:40 EDT 2024 Mon Nov 04 12:09:21 EST 2024 |
IsPeerReviewed | true |
IsScholarly | true |
Language | English |
LinkModel | DirectLink |
MergedId | FETCHMERGED-LOGICAL-c247t-1c6073c5d5ea6b4947e28f0a3da9421d124cb975746cfba405dd4b0395b5f49a3 |
ORCID | 0000-0002-8678-4699 0000-0002-6458-8128 |
PQID | 2899224392 |
PQPubID | 85426 |
PageCount | 12 |
ParticipantIDs | ieee_primary_10321676 crossref_primary_10_1109_TASLP_2023_3334101 proquest_journals_2899224392 |
PublicationCentury | 2000 |
PublicationDate | 20240000 2024-00-00 20240101 |
PublicationDateYYYYMMDD | 2024-01-01 |
PublicationDate_xml | – year: 2024 text: 20240000 |
PublicationDecade | 2020 |
PublicationPlace | Piscataway |
PublicationPlace_xml | – name: Piscataway |
PublicationTitle | IEEE/ACM transactions on audio, speech, and language processing |
PublicationTitleAbbrev | TASLP |
PublicationYear | 2024 |
Publisher | IEEE The Institute of Electrical and Electronics Engineers, Inc. (IEEE) |
Publisher_xml | – name: IEEE – name: The Institute of Electrical and Electronics Engineers, Inc. (IEEE) |
References | ref13 ref35 ref12 ref34 ref15 Jenrungrot (ref30) 2020 ref37 ref14 ref36 ref31 ref11 ref33 ref10 ref32 ref2 ref1 ref17 ref39 ref16 ref19 ref18 Kingma (ref48) 2015 ref24 Garofolo (ref38) 2007 ref23 ref45 ref26 ref25 ref47 ref20 ref42 ref41 ref22 ref44 ref21 ref43 ref28 ref27 ref29 Snyder (ref46) 2015 ref8 ref7 ref9 ref4 ref3 ref6 ref5 |
References_xml | – ident: ref25 doi: 10.1109/SLT.2018.8639593 – ident: ref16 doi: 10.1109/ASRU.2015.7404829 – ident: ref8 doi: 10.21437/Interspeech.2017-187 – ident: ref22 doi: 10.1109/TASLP.2022.3221046 – ident: ref34 doi: 10.1109/ICASSP49357.2023.10095509 – ident: ref26 doi: 10.1109/ICASSP49357.2023.10096098 – ident: ref33 doi: 10.1109/TASLP.2015.2512042 – year: 2007 ident: ref38 article-title: CSR-I (WSJ0) complete LDC93S6A contributor: fullname: Garofolo – ident: ref12 doi: 10.1007/s11265-022-01770-7 – ident: ref20 doi: 10.21437/Interspeech.2022-10018 – ident: ref7 doi: 10.1186/s13634-017-0495-7 – ident: ref24 doi: 10.21437/Interspeech.2019-2266 – ident: ref44 doi: 10.1109/TASL.2011.2180896 – ident: ref43 doi: 10.1109/TSP.2004.831998 – start-page: 20925 volume-title: Proc. Int. Conf. Adv. Neural Inf. Process. Syst. year: 2020 ident: ref30 article-title: The cone of silence: Speech separation by localization contributor: fullname: Jenrungrot – ident: ref35 doi: 10.1109/CVPR.2015.7298935 – ident: ref14 doi: 10.1109/ICASSP.2018.8461930 – ident: ref10 doi: 10.1109/SLT48900.2021.9383522 – ident: ref45 doi: 10.1109/ICASSP.2017.7952242 – ident: ref5 doi: 10.1109/TASL.2013.2263137 – ident: ref6 doi: 10.1109/TASLP.2016.2647702 – ident: ref47 doi: 10.1109/ICASSP40776.2020.9053989 – ident: ref21 doi: 10.1109/TASLP.2021.3076372 – ident: ref2 doi: 10.1002/0471221104 – ident: ref31 doi: 10.1109/AVSS56176.2022.9959632 – ident: ref23 doi: 10.1109/TASLP.2018.2881912 – year: 2015 ident: ref46 article-title: MUSAN: A music, speech, and noise corpus contributor: fullname: Snyder – ident: ref9 doi: 10.1109/icassp.2018.8462081 – ident: ref41 doi: 10.1109/icassp.2019.8683855 – ident: ref17 doi: 10.1109/ICASSP.2018.8461639 – ident: ref29 doi: 10.1109/ICASSP49357.2023.10095970 – ident: ref36 doi: 10.1109/ICASSP.2018.8461310 – ident: ref42 doi: 10.1109/ICASSP43922.2022.9746108 – ident: ref39 doi: 10.1109/TASLP.2017.2726762 – ident: ref27 doi: 10.21437/Interspeech.2022-162 – ident: ref37 doi: 10.1121/1.382599 – volume-title: Proc. Int. Conf. Learn. Representation year: 2015 ident: ref48 article-title: Adam: A method for stochastic optimization contributor: fullname: Kingma – ident: ref4 doi: 10.1109/TASL.2009.2016395 – ident: ref1 doi: 10.1002/0470031743 – ident: ref13 doi: 10.1109/ICASSP40776.2020.9053692 – ident: ref15 doi: 10.1109/ICASSP39728.2021.9414187 – ident: ref18 doi: 10.1109/ICASSP40776.2020.9053092 – ident: ref28 doi: 10.21437/Interspeech.2022-11153 – ident: ref32 doi: 10.1109/WASPAA.2019.8937218 – ident: ref3 doi: 10.1109/TASL.2010.2102754 – ident: ref19 doi: 10.1109/ICASSP40776.2020.9054177 – ident: ref11 doi: 10.23919/eusipco47968.2020.9287541 |
SSID | ssj0001079974 |
Score | 2.3220959 |
Snippet | In a multi-channel separation task with multiple speakers, we aim to recover all individual speech signals from the mixture. In contrast to single-channel... |
SourceID | proquest crossref ieee |
SourceType | Aggregation Database Publisher |
StartPage | 542 |
SubjectTerms | Artificial neural networks DNN-based Filtering Maximum likelihood detection Mixtures multi-channel Neural networks Nonlinear filters Permutations Recurrent neural networks Separation Spatial data Spatial filtering Spatial filters spatially selective filter (SSF) Speech Speech processing speech separation Task analysis Training |
Title | Multi-Channel Speech Separation Using Spatially Selective Deep Non-Linear Filters |
URI | https://ieeexplore.ieee.org/document/10321676 https://www.proquest.com/docview/2899224392 |
Volume | 32 |
hasFullText | 1 |
inHoldings | 1 |
isFullTextHit | |
isPrint | |
link | http://sdu.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1LSwMxEA62Jz34rFitkoM3SbvdTbKbY9GWHqQoreBtyWMWhbIttj34751kt1oQD94W9sEyX-aZzDeE3GqtCy6tYKCcYzxzEVMJ6qMDGVuTaVlI3zs8nqaT1-xh6Gly2HcvDACEw2fQ9ZdhL98t7MaXynqe_K0vU9kgjVRlVbPWT0ElSpUKrMsYJCiGbj_aNslEqjcbTB-fun5WeDdJ0HLXQ2C2jihMVvlljoOPGR398--OyWEdTNJBhf4J2YPylBzsUAyekefQYct8E0EJczpdAtg3OoWK8ntR0nBmgPrJxLgS5594a17ZQPoAsKSTRckwX0V9oKN3v7O-apGX0XB2P2b1GAVmY56uWd9K1GMrnAAtDVc8hTgrIp04rXjcd-jhrVGpSBGzwmiM4JzjJkqUMKLgSifnpFkuSrggtJDKmsIIBJ5zq7iJhZMGQU2EcS7mbXK3lWm-rNgy8pBlRCoPCOQegbxGoE1aXoo7T1YCbJPOFoe81qhV7hNDDDcwnLv847Urso9f51V9pEOa648NXJPGym1uwkr5AqItu7k |
link.rule.ids | 315,782,786,798,4028,27932,27933,27934,54767 |
linkProvider | IEEE |
linkToHtml | http://sdu.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1LT9tAEB4BPUAPpeWhplDYAze0wbH3kT0iIAoiRK2SSr1Z-xirlSInIuTAv2d27VAkxKE3S7Zla76d5-58A3Bmra2E8pKjCYGLfsi4KUgfA6rcu75VlYq9w8OJHv_uX99Emhz-0guDiOnwGXbjZdrLD3O_iqWyi0j-1lNabcIHKbTSTbvWv5JKpo1JvMsUJhhOjj9bt8lk5mJ6ORn96MZp4d2iINvdjoFZu6I0W-WNQU5eZrD7n__3GT614SS7bPD_AhtY78HHVySD-_Az9djy2EZQ44xNFoj-D5tgQ_o9r1k6NcDibGJai7MnujVrrCC7Rlyw8bzmlLGSRrDB37i3vjyAX4Ob6dWQt4MUuM-FfuQ9r0iTvQwSrXLCCI15v8psEawReS-Qj_fOaKkJtcpZiuFCEC4rjHSyEsYWh7BVz2v8CqxSxrvKSYJeCG-Ey2VQjmAtpAshFx04X8u0XDR8GWXKMzJTJgTKiEDZItCBgyjFV082AuzA8RqHstWpZRlTQwo4KKD79s5rp7A9nN6PytHt-O4IduhLoqmWHMPW48MKv8PmMqxO0qp5BrIdvwo |
openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Multi-Channel+Speech+Separation+Using+Spatially+Selective+Deep+Non-Linear+Filters&rft.jtitle=IEEE%2FACM+transactions+on+audio%2C+speech%2C+and+language+processing&rft.au=Tesch%2C+Kristina&rft.au=Gerkmann%2C+Timo&rft.date=2024&rft.pub=IEEE&rft.issn=2329-9290&rft.volume=32&rft.spage=542&rft.epage=553&rft_id=info:doi/10.1109%2FTASLP.2023.3334101&rft.externalDocID=10321676 |
thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=2329-9290&client=summon |
thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=2329-9290&client=summon |
thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=2329-9290&client=summon |