Analysis Comparison of FastText and Word2vec for Detecting Offensive Language
Twitter is one of the most popular platforms for sharing opinions, ideas, feelings and information. Tweets on Twitter may have language that is similar to that of a group or individual that is considered offensive. One issue brought on by offensive language is cyberbullying, which can encourage some...
Saved in:
Published in: | 2022 IEEE International Conference of Computer Science and Information Technology (ICOSNIKOM) pp. 1 - 8 |
---|---|
Main Authors: | , , , , |
Format: | Conference Proceeding |
Language: | English |
Published: |
IEEE
19-10-2022
|
Subjects: | |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | Twitter is one of the most popular platforms for sharing opinions, ideas, feelings and information. Tweets on Twitter may have language that is similar to that of a group or individual that is considered offensive. One issue brought on by offensive language is cyberbullying, which can encourage someone to ask questions online and use strong language to discuss hate. As a result, many users who interact online, including on social media, run the risk of being made fun out or harassed using abusive language that can affect the users mentally. Thus, identifying offensive language is both a necessary and useful task especially in social media platform. Offensive language can be classified to irony, sarcasm, and figurative. Currently, many research on offensive language detection simply pay attention to one of irony or sarcasm. However, offensive language may contain multi-class classification such as figurative that consist of both irony and sarcasm label. Here, we suggest categorizing tweets into four categories: irony, sarcasm, figurative or not an offensive at all (regular). Specifically, we first identify the relationship between each word using Word2vec and FastText word embedding using Continuous Bag of Words Model (CBOW) and Skip-gram architectures, and then we classify the offensive language label using CNN-BiLSTM, a combination of deep learning approaches Convolutional Neural Networks (CNN) and Bidirectional-Long Short Term Memory (Bi-LSTM) by first examining the impact of hyper-parameters on language classification. The experiment indicates using the Kaggle Dataset, CNN-BiLSTM with Word2vec with CBOW architecture outperforms CNN-BiLSTM with FastText. |
---|---|
DOI: | 10.1109/ICOSNIKOM56551.2022.10034886 |