An improved sentiment classification model based on data quality and word embeddings

User-generated content on social media platforms has reached big data levels. Sentiment analysis of this data provides opportunities to gain valuable insights into any domain. However, analyzing real-world data may confront the challenge of class imbalance, which can adversely affect the generalizat...

Full description

Saved in:

Bibliographic Details
Published in:	The Journal of supercomputing Vol. 79; no. 11; pp. 11871 - 11894
Main Authors:	Siagh, Asma, Laallam, Fatima Zohra, Kazar, Okba, Salem, Hajer
Format:	Journal Article
Language:	English
Published:	New York Springer US 01-07-2023 Springer Nature B.V
Subjects:	Big Data Compilers Computer Science Data analysis Data mining Digital media Interpreters Processor Architectures Programming Languages Sentiment analysis Social networks User generated content Deep learning Sentiment analysis Imbalanced data Natural language processing Transfer learning Word representation
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	User-generated content on social media platforms has reached big data levels. Sentiment analysis of this data provides opportunities to gain valuable insights into any domain. However, analyzing real-world data may confront the challenge of class imbalance, which can adversely affect the generalization ability of models due to majority class overfitting. Therefore, having an efficient model that manages any scenario of imbalanced data is practically needed. In this light, this work proposes different models based on studying the impact of data quality and transfer learning through pre-trained embeddings on boosting minority class detection. The proposed models are tested on imbalanced datasets related to social media and education. The experimental results highlight the effectiveness of Wor2vec, Glove, and Fasttext embeddings with preprocessed data. In contrast, BERT embeddings present better results with no-preprocessed data. Furthermore, in comparison with other methods, the best-performing model resulting from this study shows outperformance with notable improvements.
ISSN:	0920-8542 1573-0484
DOI:	10.1007/s11227-023-05099-1