Supplementary Material: AVA-ActiveSpeaker: An Audio-Visual Dataset for Active Speaker Detection

Active speaker detection is an important component in video analysis algorithms for applications such as speaker diarization, video re-targeting for meetings, speech enhancement, and human-robot interaction. The absence of a large, carefully labeled audio-visual active speaker dataset has limited al...

Full description

Saved in:

Bibliographic Details
Published in:	2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW) pp. 3718 - 3722
Main Authors:	Roth, Joseph, Chaudhuri, Sourish, Klejch, Ondrej, Marvin, Radhika, Gallagher, Andrew, Kaver, Liat, Ramaswamy, Sharadh, Stopczynski, Arkadiusz, Schmid, Cordelia, Xi, Zhonghua, Pantofaru, Caroline
Format:	Conference Proceeding
Language:	English
Published:	IEEE 01-10-2019
Subjects:	active speaker audiovisual dataset joint audiovisual Labeling machine learning modeling Music neural networks Predictive models Speech Speech recognition Synchronization Visualization
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Active speaker detection is an important component in video analysis algorithms for applications such as speaker diarization, video re-targeting for meetings, speech enhancement, and human-robot interaction. The absence of a large, carefully labeled audio-visual active speaker dataset has limited algorithm evaluation in terms of data diversity, environments, and accuracy. In this paper, we present the AVA Active Speaker detection dataset (AVA-ActiveSpeaker) which has been publicly released to facilitate algorithm development and comparison. It contains temporally labeled face tracks in video, where each face instance is labeled as speaking or not, and whether the speech is audible. This dataset contains about 3.65 million human labeled frames spanning 38.5 hours. We also introduce a state-of-the-art approach for real-time active speaker detection and compare several variants. This evaluation clearly demonstrates a significant gain due to audio-visual modeling and temporal integration over multiple frames.
ISSN:	2473-9944
DOI:	10.1109/ICCVW.2019.00460