13 years of speaker recognition research at BUT, with longitudinal analysis of NIST SRE

•We present a “longitudinal study” of all important milestone techniques used in speaker recognition by evaluating on multiple NIST SREs.•We provide aa analysis of difficulty of individual NIST SREs.•We investigate the impact of the amount of training data on performance of particular Speaker Recogn...

Full description

Saved in:

Bibliographic Details
Published in:	Computer speech & language Vol. 63; p. 101035
Main Authors:	Matějka, Pavel, Plchot, Oldřich, Glembek, Ondřej, Burget, Lukáš, Rohdin, Johan, Zeinali, Hossein, Mošner, Ladislav, Silnova, Anna, Novotný, Ondřej, Diez, Mireia, “Honza” Černocký, Jan
Format:	Journal Article
Language:	English
Published:	Elsevier Ltd 01-09-2020
Subjects:	DNN Embedding Eigen-channel compensation Evaluations GMM I-vectors JFA NIST Speaker recognition X-vectors DNN Embedding JFA GMM I-vectors X-vectors NIST Eigen-channel compensation Evaluations Speaker recognition
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	•We present a “longitudinal study” of all important milestone techniques used in speaker recognition by evaluating on multiple NIST SREs.•We provide aa analysis of difficulty of individual NIST SREs.•We investigate the impact of the amount of training data on performance of particular Speaker Recognition methods.•We evaluate milestone techniques also on the Speakers In The Wild (SITW) and VOiCES challenge datasets, as the amount of- and interest in user-contributed audiovisual content grows. In this paper, we present a brief history and a “longitudinal study” of all important milestone modelling techniques used in text independent speaker recognition since Brno University of Technology (BUT) first participated in the NIST Speaker Recognition Evaluation (SRE) in 2006—GMM MAP, GMM MAP with eigen-channel adaptation, Joint Factor Analysis, i-vector and DNN embedding (x-vector). To emphasize the historical context, the techniques are evaluated on all NIST SRE sets since 2004 on a time-machine principle, i.e. a system is always trained using all data available up till the year of evaluation. Moreover, as user-contributed audiovisual content dominates nowadays’ Internet, we representatively include the Speakers In The Wild (SITW) and VOiCES challenge datasets in the evaluation of our systems. Not only we present a comparison of the modelling techniques, but we also show the effect of sampling frequency.
ISSN:	0885-2308 1095-8363
DOI:	10.1016/j.csl.2019.101035