Analysis Comparison of FastText and Word2vec for Detecting Offensive Language

Twitter is one of the most popular platforms for sharing opinions, ideas, feelings and information. Tweets on Twitter may have language that is similar to that of a group or individual that is considered offensive. One issue brought on by offensive language is cyberbullying, which can encourage some...

Full description

Saved in:
Bibliographic Details
Published in:2022 IEEE International Conference of Computer Science and Information Technology (ICOSNIKOM) pp. 1 - 8
Main Authors: Lumbantoruan, Rosni, Siregar, Rifka Uli, Manik, Indah, Tambunan, Nadya, Simanjuntak, Humasak
Format: Conference Proceeding
Language:English
Published: IEEE 19-10-2022
Subjects:
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Twitter is one of the most popular platforms for sharing opinions, ideas, feelings and information. Tweets on Twitter may have language that is similar to that of a group or individual that is considered offensive. One issue brought on by offensive language is cyberbullying, which can encourage someone to ask questions online and use strong language to discuss hate. As a result, many users who interact online, including on social media, run the risk of being made fun out or harassed using abusive language that can affect the users mentally. Thus, identifying offensive language is both a necessary and useful task especially in social media platform. Offensive language can be classified to irony, sarcasm, and figurative. Currently, many research on offensive language detection simply pay attention to one of irony or sarcasm. However, offensive language may contain multi-class classification such as figurative that consist of both irony and sarcasm label. Here, we suggest categorizing tweets into four categories: irony, sarcasm, figurative or not an offensive at all (regular). Specifically, we first identify the relationship between each word using Word2vec and FastText word embedding using Continuous Bag of Words Model (CBOW) and Skip-gram architectures, and then we classify the offensive language label using CNN-BiLSTM, a combination of deep learning approaches Convolutional Neural Networks (CNN) and Bidirectional-Long Short Term Memory (Bi-LSTM) by first examining the impact of hyper-parameters on language classification. The experiment indicates using the Kaggle Dataset, CNN-BiLSTM with Word2vec with CBOW architecture outperforms CNN-BiLSTM with FastText.
DOI:10.1109/ICOSNIKOM56551.2022.10034886