Multimodal Emotion Recognition via Convolutional Neural Networks: Comparison of different strategies on two multimodal datasets

The aim of this paper is to investigate emotion recognition using a multimodal approach that exploits convolutional neural networks (CNNs) with multiple input. Multimodal approaches allow different modalities to cooperate in order to achieve generally better performances because different features a...

Full description

Saved in:

Bibliographic Details
Published in:	Engineering applications of artificial intelligence Vol. 130; p. 107708
Main Authors:	Bilotti, U., Bisogni, C., De Marsico, M., Tramonte, S.
Format:	Journal Article
Language:	English
Published:	Elsevier Ltd 01-04-2024
Subjects:	Biometrics Deep learning Emotion recognition Multi-input model Multimodal emotion recognition Biometrics Deep learning Emotion recognition Multimodal emotion recognition Multi-input model
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	The aim of this paper is to investigate emotion recognition using a multimodal approach that exploits convolutional neural networks (CNNs) with multiple input. Multimodal approaches allow different modalities to cooperate in order to achieve generally better performances because different features are extracted from different pieces of information. In this work, the facial frames, the optical flow computed from consecutive facial frames, and the Mel Spectrograms (from the word melody) are extracted from videos and combined together in different ways to understand which modality combination works better. Several experiments are run on the models by first considering one modality at a time so that good accuracy results are found on each modality. Afterward, the models are concatenated to create a final model that allows multiple inputs. For the experiments the datasets used are BAUM-1 ((Bahçeşehir University Multimodal Affective Database - 1) and RAVDESS (Ryerson Audio–Visual Database of Emotional Speech and Song), which both collect two distinguished sets of videos based on the different intensity of the expression, that is acted/strong or spontaneous/normal, providing the representations of the following emotional states that will be taken into consideration: angry, disgust, fearful, happy and sad. The performances of the proposed models are shown through accuracy results and some confusion matrices, demonstrating better accuracy than the compared proposals in the literature. The best accuracy achieved on BAUM-1 dataset is about 95%, while on RAVDESS it is about 95.5%. [Display omitted] •Emotion recognition through multimodal architectures.•Comparison of early to late fusion of 1-input models.•Synchronization of video and audio channels.
ISSN:	0952-1976 1873-6769
DOI:	10.1016/j.engappai.2023.107708