Multi-Channel Speech Separation Using Spatially Selective Deep Non-Linear Filters

In a multi-channel separation task with multiple speakers, we aim to recover all individual speech signals from the mixture. In contrast to single-channel approaches, which rely on the different spectro-temporal characteristics of the speech signals, multi-channel approaches should additionally util...

Full description

Saved in:
Bibliographic Details
Published in:IEEE/ACM transactions on audio, speech, and language processing Vol. 32; pp. 542 - 553
Main Authors: Tesch, Kristina, Gerkmann, Timo
Format: Journal Article
Language:English
Published: Piscataway IEEE 2024
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects:
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Abstract In a multi-channel separation task with multiple speakers, we aim to recover all individual speech signals from the mixture. In contrast to single-channel approaches, which rely on the different spectro-temporal characteristics of the speech signals, multi-channel approaches should additionally utilize the different spatial locations of the sources for a more powerful separation especially when the number of sources increases. To enhance the spatial processing in a multi-channel source separation scenario, in this work, we propose a deep neural network (DNN) based spatially selective filter (SSF) that can be spatially steered to extract the speaker of interest by initializing a recurrent neural network layer with the target direction. We compare the proposed SSF with a common end-to-end direct separation (DS) approach trained using utterance-wise permutation invariant training (PIT), which only implicitly learns to perform spatial filtering. We show that the SSF has a clear advantage over a DS approach with the same underlying network architecture when there are more than two speakers in the mixture, which can be attributed to a better use of the spatial information. Furthermore, we find that the SSF generalizes much better to additional noise sources that were not seen during training and to scenarios with speakers positioned at a similar angle.
AbstractList In a multi-channel separation task with multiple speakers, we aim to recover all individual speech signals from the mixture. In contrast to single-channel approaches, which rely on the different spectro-temporal characteristics of the speech signals, multi-channel approaches should additionally utilize the different spatial locations of the sources for a more powerful separation especially when the number of sources increases. To enhance the spatial processing in a multi-channel source separation scenario, in this work, we propose a deep neural network (DNN) based spatially selective filter (SSF) that can be spatially steered to extract the speaker of interest by initializing a recurrent neural network layer with the target direction. We compare the proposed SSF with a common end-to-end direct separation (DS) approach trained using utterance-wise permutation invariant training (PIT), which only implicitly learns to perform spatial filtering. We show that the SSF has a clear advantage over a DS approach with the same underlying network architecture when there are more than two speakers in the mixture, which can be attributed to a better use of the spatial information. Furthermore, we find that the SSF generalizes much better to additional noise sources that were not seen during training and to scenarios with speakers positioned at a similar angle.
Author Tesch, Kristina
Gerkmann, Timo
Author_xml – sequence: 1
  givenname: Kristina
  orcidid: 0000-0002-6458-8128
  surname: Tesch
  fullname: Tesch, Kristina
  email: kristina.tesch@uni-hamburg.de
  organization: Signal Processing Group, Department of Informatics, Universität Hamburg, Hamburg, Germany
– sequence: 2
  givenname: Timo
  orcidid: 0000-0002-8678-4699
  surname: Gerkmann
  fullname: Gerkmann, Timo
  email: timo.gerkmann@uni-hamburg.de
  organization: Signal Processing Group, Department of Informatics, Universität Hamburg, Hamburg, Germany
BookMark eNpNkFtLwzAYhoMoOOf-gHhR8LozpzbL5ZhOhXpi23VI068uI6Y16YT9ezs3wavv9L7vB88FOvWNB4SuCB4TguXtcroo3sYUUzZmjHGCyQkaUEZlKhnmp389lfgcjWLcYIwJFlIKPkDvz1vX2XS21t6DSxYtgFknC2h10J1tfLKK1n_0-37Szu36kwPT2W9I7gDa5KXxaWE96JDMresgxEt0VmsXYXSsQ7Sa3y9nj2nx-vA0mxapoVx0KTE5FsxkVQY6L7nkAuikxppVWnJKKkK5KaXIBM9NXWqOs6riJWYyK7OaS82G6OaQ24bmawuxU5tmG3z_UtGJlJRyJmmvogeVCU2MAWrVBvupw04RrPb01C89taenjvR60_XBZAHgn4FRkouc_QDoJGxy
CODEN ITASFA
Cites_doi 10.1109/SLT.2018.8639593
10.1109/ASRU.2015.7404829
10.21437/Interspeech.2017-187
10.1109/TASLP.2022.3221046
10.1109/ICASSP49357.2023.10095509
10.1109/ICASSP49357.2023.10096098
10.1109/TASLP.2015.2512042
10.1007/s11265-022-01770-7
10.21437/Interspeech.2022-10018
10.1186/s13634-017-0495-7
10.21437/Interspeech.2019-2266
10.1109/TASL.2011.2180896
10.1109/TSP.2004.831998
10.1109/CVPR.2015.7298935
10.1109/ICASSP.2018.8461930
10.1109/SLT48900.2021.9383522
10.1109/ICASSP.2017.7952242
10.1109/TASL.2013.2263137
10.1109/TASLP.2016.2647702
10.1109/ICASSP40776.2020.9053989
10.1109/TASLP.2021.3076372
10.1002/0471221104
10.1109/AVSS56176.2022.9959632
10.1109/TASLP.2018.2881912
10.1109/icassp.2018.8462081
10.1109/icassp.2019.8683855
10.1109/ICASSP.2018.8461639
10.1109/ICASSP49357.2023.10095970
10.1109/ICASSP.2018.8461310
10.1109/ICASSP43922.2022.9746108
10.1109/TASLP.2017.2726762
10.21437/Interspeech.2022-162
10.1121/1.382599
10.1109/TASL.2009.2016395
10.1002/0470031743
10.1109/ICASSP40776.2020.9053692
10.1109/ICASSP39728.2021.9414187
10.1109/ICASSP40776.2020.9053092
10.21437/Interspeech.2022-11153
10.1109/WASPAA.2019.8937218
10.1109/TASL.2010.2102754
10.1109/ICASSP40776.2020.9054177
10.23919/eusipco47968.2020.9287541
ContentType Journal Article
Copyright Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2024
Copyright_xml – notice: Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2024
DBID 97E
RIA
RIE
AAYXX
CITATION
7SC
8FD
JQ2
L7M
L~C
L~D
DOI 10.1109/TASLP.2023.3334101
DatabaseName IEEE All-Society Periodicals Package (ASPP) 2005-present
IEEE All-Society Periodicals Package (ASPP) 1998–Present
IEEE Electronic Library Online
CrossRef
Computer and Information Systems Abstracts
Technology Research Database
ProQuest Computer Science Collection
Advanced Technologies Database with Aerospace
Computer and Information Systems Abstracts – Academic
Computer and Information Systems Abstracts Professional
DatabaseTitle CrossRef
Computer and Information Systems Abstracts
Technology Research Database
Computer and Information Systems Abstracts – Academic
Advanced Technologies Database with Aerospace
ProQuest Computer Science Collection
Computer and Information Systems Abstracts Professional
DatabaseTitleList Computer and Information Systems Abstracts

Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Electronic Library Online
  url: http://ieeexplore.ieee.org/Xplore/DynWel.jsp
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
Discipline Engineering
EISSN 2329-9304
EndPage 553
ExternalDocumentID 10_1109_TASLP_2023_3334101
10321676
Genre orig-research
GrantInformation_xml – fundername: Deutsche Forschungsgemeinschaft
  grantid: 508337379
  funderid: 10.13039/501100001659
GroupedDBID 0R~
4.4
6IK
97E
AAJGR
AAKMM
AALFJ
AASAJ
AAWTV
ABQJQ
ACIWK
ACM
ADBCU
ADPZR
AEBYY
AENSD
AFWIH
AFWXC
AIKLT
AKJIK
ALMA_UNASSIGNED_HOLDINGS
BEFXN
BFFAM
BGNUA
BKEBE
BPEOZ
CCLIF
EBS
EJD
GUFHI
HGAVV
IFIPE
IPLJI
JAVBF
LHSKQ
M43
OCL
PQQKQ
RIA
RIE
RNS
ROL
AAYXX
CITATION
7SC
8FD
JQ2
L7M
L~C
L~D
ID FETCH-LOGICAL-c247t-1c6073c5d5ea6b4947e28f0a3da9421d124cb975746cfba405dd4b0395b5f49a3
IEDL.DBID RIE
ISSN 2329-9290
IngestDate Thu Oct 10 20:27:51 EDT 2024
Fri Aug 23 00:55:40 EDT 2024
Mon Nov 04 12:09:21 EST 2024
IsPeerReviewed true
IsScholarly true
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-c247t-1c6073c5d5ea6b4947e28f0a3da9421d124cb975746cfba405dd4b0395b5f49a3
ORCID 0000-0002-8678-4699
0000-0002-6458-8128
PQID 2899224392
PQPubID 85426
PageCount 12
ParticipantIDs ieee_primary_10321676
crossref_primary_10_1109_TASLP_2023_3334101
proquest_journals_2899224392
PublicationCentury 2000
PublicationDate 20240000
2024-00-00
20240101
PublicationDateYYYYMMDD 2024-01-01
PublicationDate_xml – year: 2024
  text: 20240000
PublicationDecade 2020
PublicationPlace Piscataway
PublicationPlace_xml – name: Piscataway
PublicationTitle IEEE/ACM transactions on audio, speech, and language processing
PublicationTitleAbbrev TASLP
PublicationYear 2024
Publisher IEEE
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Publisher_xml – name: IEEE
– name: The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
References ref13
ref35
ref12
ref34
ref15
Jenrungrot (ref30) 2020
ref37
ref14
ref36
ref31
ref11
ref33
ref10
ref32
ref2
ref1
ref17
ref39
ref16
ref19
ref18
Kingma (ref48) 2015
ref24
Garofolo (ref38) 2007
ref23
ref45
ref26
ref25
ref47
ref20
ref42
ref41
ref22
ref44
ref21
ref43
ref28
ref27
ref29
Snyder (ref46) 2015
ref8
ref7
ref9
ref4
ref3
ref6
ref5
References_xml – ident: ref25
  doi: 10.1109/SLT.2018.8639593
– ident: ref16
  doi: 10.1109/ASRU.2015.7404829
– ident: ref8
  doi: 10.21437/Interspeech.2017-187
– ident: ref22
  doi: 10.1109/TASLP.2022.3221046
– ident: ref34
  doi: 10.1109/ICASSP49357.2023.10095509
– ident: ref26
  doi: 10.1109/ICASSP49357.2023.10096098
– ident: ref33
  doi: 10.1109/TASLP.2015.2512042
– year: 2007
  ident: ref38
  article-title: CSR-I (WSJ0) complete LDC93S6A
  contributor:
    fullname: Garofolo
– ident: ref12
  doi: 10.1007/s11265-022-01770-7
– ident: ref20
  doi: 10.21437/Interspeech.2022-10018
– ident: ref7
  doi: 10.1186/s13634-017-0495-7
– ident: ref24
  doi: 10.21437/Interspeech.2019-2266
– ident: ref44
  doi: 10.1109/TASL.2011.2180896
– ident: ref43
  doi: 10.1109/TSP.2004.831998
– start-page: 20925
  volume-title: Proc. Int. Conf. Adv. Neural Inf. Process. Syst.
  year: 2020
  ident: ref30
  article-title: The cone of silence: Speech separation by localization
  contributor:
    fullname: Jenrungrot
– ident: ref35
  doi: 10.1109/CVPR.2015.7298935
– ident: ref14
  doi: 10.1109/ICASSP.2018.8461930
– ident: ref10
  doi: 10.1109/SLT48900.2021.9383522
– ident: ref45
  doi: 10.1109/ICASSP.2017.7952242
– ident: ref5
  doi: 10.1109/TASL.2013.2263137
– ident: ref6
  doi: 10.1109/TASLP.2016.2647702
– ident: ref47
  doi: 10.1109/ICASSP40776.2020.9053989
– ident: ref21
  doi: 10.1109/TASLP.2021.3076372
– ident: ref2
  doi: 10.1002/0471221104
– ident: ref31
  doi: 10.1109/AVSS56176.2022.9959632
– ident: ref23
  doi: 10.1109/TASLP.2018.2881912
– year: 2015
  ident: ref46
  article-title: MUSAN: A music, speech, and noise corpus
  contributor:
    fullname: Snyder
– ident: ref9
  doi: 10.1109/icassp.2018.8462081
– ident: ref41
  doi: 10.1109/icassp.2019.8683855
– ident: ref17
  doi: 10.1109/ICASSP.2018.8461639
– ident: ref29
  doi: 10.1109/ICASSP49357.2023.10095970
– ident: ref36
  doi: 10.1109/ICASSP.2018.8461310
– ident: ref42
  doi: 10.1109/ICASSP43922.2022.9746108
– ident: ref39
  doi: 10.1109/TASLP.2017.2726762
– ident: ref27
  doi: 10.21437/Interspeech.2022-162
– ident: ref37
  doi: 10.1121/1.382599
– volume-title: Proc. Int. Conf. Learn. Representation
  year: 2015
  ident: ref48
  article-title: Adam: A method for stochastic optimization
  contributor:
    fullname: Kingma
– ident: ref4
  doi: 10.1109/TASL.2009.2016395
– ident: ref1
  doi: 10.1002/0470031743
– ident: ref13
  doi: 10.1109/ICASSP40776.2020.9053692
– ident: ref15
  doi: 10.1109/ICASSP39728.2021.9414187
– ident: ref18
  doi: 10.1109/ICASSP40776.2020.9053092
– ident: ref28
  doi: 10.21437/Interspeech.2022-11153
– ident: ref32
  doi: 10.1109/WASPAA.2019.8937218
– ident: ref3
  doi: 10.1109/TASL.2010.2102754
– ident: ref19
  doi: 10.1109/ICASSP40776.2020.9054177
– ident: ref11
  doi: 10.23919/eusipco47968.2020.9287541
SSID ssj0001079974
Score 2.3220959
Snippet In a multi-channel separation task with multiple speakers, we aim to recover all individual speech signals from the mixture. In contrast to single-channel...
SourceID proquest
crossref
ieee
SourceType Aggregation Database
Publisher
StartPage 542
SubjectTerms Artificial neural networks
DNN-based
Filtering
Maximum likelihood detection
Mixtures
multi-channel
Neural networks
Nonlinear filters
Permutations
Recurrent neural networks
Separation
Spatial data
Spatial filtering
Spatial filters
spatially selective filter (SSF)
Speech
Speech processing
speech separation
Task analysis
Training
Title Multi-Channel Speech Separation Using Spatially Selective Deep Non-Linear Filters
URI https://ieeexplore.ieee.org/document/10321676
https://www.proquest.com/docview/2899224392
Volume 32
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://sdu.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1LSwMxEA62Jz34rFitkoM3SbvdTbKbY9GWHqQoreBtyWMWhbIttj34751kt1oQD94W9sEyX-aZzDeE3GqtCy6tYKCcYzxzEVMJ6qMDGVuTaVlI3zs8nqaT1-xh6Gly2HcvDACEw2fQ9ZdhL98t7MaXynqe_K0vU9kgjVRlVbPWT0ElSpUKrMsYJCiGbj_aNslEqjcbTB-fun5WeDdJ0HLXQ2C2jihMVvlljoOPGR398--OyWEdTNJBhf4J2YPylBzsUAyekefQYct8E0EJczpdAtg3OoWK8ntR0nBmgPrJxLgS5594a17ZQPoAsKSTRckwX0V9oKN3v7O-apGX0XB2P2b1GAVmY56uWd9K1GMrnAAtDVc8hTgrIp04rXjcd-jhrVGpSBGzwmiM4JzjJkqUMKLgSifnpFkuSrggtJDKmsIIBJ5zq7iJhZMGQU2EcS7mbXK3lWm-rNgy8pBlRCoPCOQegbxGoE1aXoo7T1YCbJPOFoe81qhV7hNDDDcwnLv847Urso9f51V9pEOa648NXJPGym1uwkr5AqItu7k
link.rule.ids 315,782,786,798,4028,27932,27933,27934,54767
linkProvider IEEE
linkToHtml http://sdu.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1LT9tAEB4BPUAPpeWhplDYAze0wbH3kT0iIAoiRK2SSr1Z-xirlSInIuTAv2d27VAkxKE3S7Zla76d5-58A3Bmra2E8pKjCYGLfsi4KUgfA6rcu75VlYq9w8OJHv_uX99Emhz-0guDiOnwGXbjZdrLD3O_iqWyi0j-1lNabcIHKbTSTbvWv5JKpo1JvMsUJhhOjj9bt8lk5mJ6ORn96MZp4d2iINvdjoFZu6I0W-WNQU5eZrD7n__3GT614SS7bPD_AhtY78HHVySD-_Az9djy2EZQ44xNFoj-D5tgQ_o9r1k6NcDibGJai7MnujVrrCC7Rlyw8bzmlLGSRrDB37i3vjyAX4Ob6dWQt4MUuM-FfuQ9r0iTvQwSrXLCCI15v8psEawReS-Qj_fOaKkJtcpZiuFCEC4rjHSyEsYWh7BVz2v8CqxSxrvKSYJeCG-Ey2VQjmAtpAshFx04X8u0XDR8GWXKMzJTJgTKiEDZItCBgyjFV082AuzA8RqHstWpZRlTQwo4KKD79s5rp7A9nN6PytHt-O4IduhLoqmWHMPW48MKv8PmMqxO0qp5BrIdvwo
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Multi-Channel+Speech+Separation+Using+Spatially+Selective+Deep+Non-Linear+Filters&rft.jtitle=IEEE%2FACM+transactions+on+audio%2C+speech%2C+and+language+processing&rft.au=Tesch%2C+Kristina&rft.au=Gerkmann%2C+Timo&rft.date=2024&rft.pub=IEEE&rft.issn=2329-9290&rft.volume=32&rft.spage=542&rft.epage=553&rft_id=info:doi/10.1109%2FTASLP.2023.3334101&rft.externalDocID=10321676
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=2329-9290&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=2329-9290&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=2329-9290&client=summon