An empirical study to address the problem of Unbalanced Data Sets in sentiment classification

With the emergence of Web 2.0, Sentiment Analysis is receiving more and more attention. Several interesting works were performed to address different issues in Sentiment Analysis. Nevertheless, the problem of Unbalanced Data Sets was not enough tackled within this research area. This paper presents...

Full description

Saved in:

Bibliographic Details
Published in:	2012 IEEE International Conference on Systems, Man, and Cybernetics (SMC) pp. 3298 - 3303
Main Authors:	Mountassir, A., Benbrahim, H., Berrada, I.
Format:	Conference Proceeding
Language:	English
Published:	IEEE 01-10-2012
Subjects:	Accuracy Labeling Machine Learning Natural Language Processing Niobium Opinion Mining Radio frequency Sampling methods Sentiment Analysis Support vector machines Text Classification Training Unbalanced Data sets
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	With the emergence of Web 2.0, Sentiment Analysis is receiving more and more attention. Several interesting works were performed to address different issues in Sentiment Analysis. Nevertheless, the problem of Unbalanced Data Sets was not enough tackled within this research area. This paper presents the study we have carried out to address the problem of unbalanced data sets in supervised sentiment classification in a multi-lingual context. We propose three different methods to under-sample the majority class documents. These methods are Remove Similar, Remove Farthest and Remove by Clustering. Our goal is to compare the effectiveness of the proposed methods with the common random under-sampling. We also aim to evaluate the behavior of the classifiers toward different under-sampling rates. We use three different common classifiers, namely Naïve Bayes, Support Vector Machines and k-Nearest Neighbors. The experiments are carried out on two Arabic data sets and an English data set. We show that the four under-sampling methods are typically competitive. Naïve Bayes is shown as insensitive to unbalanced data sets. But Support Vector Machines seems to be highly sensitive to unbalanced data sets; k-Nearest Neighbors shows a slight sensitivity to imbalance in comparison with Support Vector Machines.
ISBN:	9781467317139 1467317136
ISSN:	1062-922X 2577-1655
DOI:	10.1109/ICSMC.2012.6378300