Speaker activity driven neural speech extraction
Target speech extraction, which extracts the speech of a target speaker in a mixture given auxiliary speaker clues, has recently received increased interest. Various clues have been investigated such as pre-recorded enrollment utterances, direction information, or video of the target speaker. In thi...
Saved in:
Main Authors: | , , , , |
---|---|
Format: | Journal Article |
Language: | English |
Published: |
14-01-2021
|
Subjects: | |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | Target speech extraction, which extracts the speech of a target speaker in a
mixture given auxiliary speaker clues, has recently received increased
interest. Various clues have been investigated such as pre-recorded enrollment
utterances, direction information, or video of the target speaker. In this
paper, we explore the use of speaker activity information as an auxiliary clue
for single-channel neural network-based speech extraction. We propose a speaker
activity driven speech extraction neural network (ADEnet) and show that it can
achieve performance levels competitive with enrollment-based approaches,
without the need for pre-recordings. We further demonstrate the potential of
the proposed approach for processing meeting-like recordings, where the speaker
activity is obtained from a diarization system. We show that this simple yet
practical approach can successfully extract speakers after diarization, which
results in improved ASR performance, especially in high overlapping conditions,
with a relative word error rate reduction of up to 25%. |
---|---|
DOI: | 10.48550/arxiv.2101.05516 |