Deep Neural Network Model for Recognition of Speaker's Emotion

The article is devoted to the development of neural network tools for recognizing the speaker's emotions. It is determined that a deep neural network of the multi-layer perceptron type is the most effective when recognizing emotions on fixed fragments of a speech signal. The expediency of train...

Full description

Saved in:

Bibliographic Details
Published in:	2020 IEEE International Conference on Problems of Infocommunications. Science and Technology (PIC S&T) pp. 172 - 176
Main Authors:	Toliupa, Sergey, Tereikovskyi, Ihor, Tereikovska, Liudmyla, Mussiraliyeva, Shynar, Bagitova, Kalamkas
Format:	Conference Proceeding
Language:	English
Published:	IEEE 06-10-2020
Subjects:	Adaptation models deep neural network(DNN) Emotion recognition multi-layer perceptron Neurons Speech recognition speech signal Training
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	The article is devoted to the development of neural network tools for recognizing the speaker's emotions. It is determined that a deep neural network of the multi-layer perceptron type is the most effective when recognizing emotions on fixed fragments of a speech signal. The expediency of training the network on the examples of the TESS database is proved, where each individual record corresponds to one of the seven basic emotions. The architectural parameters of the neural network model are calculated based on the use of the specified speech corpus. The output neurons of the network are associated with 7 emotions. The number of hidden layers of neurons is 2. The number of neurons in each hidden layer is 200. The input neurons of the network are associated with Mel-frequency cepstral coefficients (MFCC) of each of the quasi-stationary fragments of the speech signal. The expression is developed to calculate the number of input neurons depending on the number of Mel-frequency cepstral coefficients. The feasibility of describing one quasi-stationary fragment with 20 Mel-frequency cepstral coefficients was determined by computer experiments. At an acceptable level of resource intensity, the developed neural network model allows to achieve an accuracy of emotion recognition of about 0.94, which corresponds to known tools of similar purpose. The necessity of further research is justified in the direction of developing a method for neural network recognition of the speaker's emotions using CNN.
DOI:	10.1109/PICST51311.2020.9468017