Fusion of mel and gammatone frequency cepstral coefficients for speech emotion recognition using deep C-RNN

Emotions play a significant role in human life. Recognition of human emotions has numerous tasks in recognizing the emotional features of speech signals. In this regard, Speech Emotion Recognition (SER) has multiple applications in various fields of education, health, forensics, defense, robotics, a...

Full description

Saved in:
Bibliographic Details
Published in:International journal of speech technology Vol. 24; no. 2; pp. 303 - 314
Main Authors: Kumaran, U., Radha Rammohan, S., Nagarajan, Senthil Murugan, Prathik, A.
Format: Journal Article
Language:English
Published: New York Springer US 01-06-2021
Springer Nature B.V
Subjects:
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Emotions play a significant role in human life. Recognition of human emotions has numerous tasks in recognizing the emotional features of speech signals. In this regard, Speech Emotion Recognition (SER) has multiple applications in various fields of education, health, forensics, defense, robotics, and scientific purposes. However, SER has the limitations of data labeling, misinterpretation of speech, annotation of audio, and time complexity. This work presents the evaluation of SER based on the features extracted from Mel Frequency Cepstral Coefficients (MFCC) and Gammatone Frequency Cepstral Coefficients (GFCC) to study the emotions from different versions of audio signals. The sound signals are segmented by extracting and parametrizing each frequency calls using MFCC, GFCC, and combined features (M-GFCC) in the feature extraction stage. With the recent advances in Deep Learning techniques, this paper proposes a Deep Convolutional-Recurrent Neural Network (Deep C-RNN) approach to classify the effectiveness of learning emotion variations in the classification stage. We use a fusion of Mel–Gammatone filter in convolutional layers to first extract high-level spectral features then recurrent layers is adopted to learn the long-term temporal context from high-level features. Also, the proposed work differentiates the emotions from neutral speech with suitable binary tree diagrammatic illustrations. The methodology of the proposed work is applied on a large dataset covering Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) dataset. Finally, the proposed results which obtained accuracy more than 80% and have less loss are compared with the state of the art approaches, and an experimental result provides evidence that fusion results outperform in recognizing emotions from speech signals.
ISSN:1381-2416
1572-8110
DOI:10.1007/s10772-020-09792-x