Histogram equalization using a reduced feature set of background speakers＇ utterances for speaker recognition

We propose a method for histogram equalization using supplement sets to improve the performance of speaker recognition when the training and test utterances are very short. The supplement sets are derived using outputs of selection or clustering algorithms from the background speakers＇ utterances. T...

Full description

Saved in:

Bibliographic Details
Published in:	Frontiers of information technology & electronic engineering Vol. 18; no. 5; pp. 738 - 750
Main Authors:	Kim, Myung-jae, Yang, Il-ho, Kim, Min-seok, Yu, Ha-jin
Format:	Journal Article
Language:	English
Published:	Hangzhou Zhejiang University Press 01-05-2017 Springer Nature B.V
Subjects:	Algorithms Clustering Codec Communications Engineering Computer Hardware Computer Science Computer Systems Organization and Communication Networks Discriminant analysis Distribution functions Electrical Engineering Electronics and Microelectronics Equalization Greedy algorithms Histograms Instrumentation Mean Networks Performance enhancement Speech recognition Telecommunications TN912.34 i-vector Speaker recognition Histogram equalization
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	We propose a method for histogram equalization using supplement sets to improve the performance of speaker recognition when the training and test utterances are very short. The supplement sets are derived using outputs of selection or clustering algorithms from the background speakers＇ utterances. The proposed approach is used as a feature normalization method for building histograms when there are insufficient input utterance samples.In addition, the proposed method is used as an i-vector normalization method in an i-vector-based probabilistic linear discriminant analysis（PLDA） system, which is the current state-of-the-art for speaker verification. The ranks of sample values for histogram equalization are estimated in ascending order from both the input utterances and the supplement set. New ranks are obtained by computing the sum of different kinds of ranks. Subsequently, the proposed method determines the cumulative distribution function of the test utterance using the newly defined ranks. The proposed method is compared with conventional feature normalization methods, such as cepstral mean normalization（CMN）, cepstral mean and variance normalization（MVN）, histogram equalization（HEQ）, and the European Telecommunications Standards Institute（ETSI） advanced front-end methods. In addition, performance is compared for a case in which the greedy selection algorithm is used with fuzzy C-means and K-means algorithms.The YOHO and Electronics and Telecommunications Research Institute（ETRI） databases are used in an evaluation in the feature space. The test sets are simulated by the Opus Vo IP codec. We also use the 2008 National Institute of Standards and Technology（NIST） speaker recognition evaluation（SRE） corpus for the i-vector system. The results of the experimental evaluation demonstrate that the average system performance is improved when the proposed method is used, compared to the conventional feature normalization methods.
Bibliography:	Speaker recognition; Histogram equalization; i-vector We propose a method for histogram equalization using supplement sets to improve the performance of speaker recognition when the training and test utterances are very short. The supplement sets are derived using outputs of selection or clustering algorithms from the background speakers＇ utterances. The proposed approach is used as a feature normalization method for building histograms when there are insufficient input utterance samples.In addition, the proposed method is used as an i-vector normalization method in an i-vector-based probabilistic linear discriminant analysis（PLDA） system, which is the current state-of-the-art for speaker verification. The ranks of sample values for histogram equalization are estimated in ascending order from both the input utterances and the supplement set. New ranks are obtained by computing the sum of different kinds of ranks. Subsequently, the proposed method determines the cumulative distribution function of the test utterance using the newly defined ranks. The proposed method is compared with conventional feature normalization methods, such as cepstral mean normalization（CMN）, cepstral mean and variance normalization（MVN）, histogram equalization（HEQ）, and the European Telecommunications Standards Institute（ETSI） advanced front-end methods. In addition, performance is compared for a case in which the greedy selection algorithm is used with fuzzy C-means and K-means algorithms.The YOHO and Electronics and Telecommunications Research Institute（ETRI） databases are used in an evaluation in the feature space. The test sets are simulated by the Opus Vo IP codec. We also use the 2008 National Institute of Standards and Technology（NIST） speaker recognition evaluation（SRE） corpus for the i-vector system. The results of the experimental evaluation demonstrate that the average system performance is improved when the proposed method is used, compared to the conventional feature normalization methods. 33-1389/TP
ISSN:	2095-9184 2095-9230
DOI:	10.1631/FITEE.1500380