Introducing the Gab Hate Corpus: defining and applying hate-based rhetoric to social media posts at scale

We present the Gab Hate Corpus (GHC), consisting of 27,665 posts from the social network service gab.com, each annotated for the presence of “hate-based rhetoric” by a minimum of three annotators. Posts were labeled according to a coding typology derived from a synthesis of hate speech definitions a...

Full description

Saved in:

Bibliographic Details
Published in:	Language resources and evaluation Vol. 56; no. 1; pp. 79 - 108
Main Authors:	Kennedy, Brendan, Atari, Mohammad, Davani, Aida Mostafazadeh, Yeh, Leigh, Omrani, Ali, Kim, Yehsong, Coombs, Kris, Havaldar, Shreya, Portillo-Wightman, Gwenyth, Gonzalez, Elaine, Hoover, Joe, Azatian, Aida, Hussain, Alyzeh, Lara, Austin, Cardenas, Gabriel, Omary, Adam, Park, Christina, Wang, Xin, Wijaya, Clarisa, Zhang, Yong, Meyerowitz, Beth, Dehghani, Morteza
Format:	Journal Article
Language:	English
Published:	Dordrecht Springer Netherlands 01-03-2022 Springer Nature B.V
Subjects:	Coding Computational Linguistics Computer mediated communication Computer Science Corpus analysis Corpus linguistics Digital media Hate speech Language and Literature Linguistics Original Paper Psychology Rhetoric Social media Social networks Social Sciences Sociology Hate speech Social science Social media Text classification
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	We present the Gab Hate Corpus (GHC), consisting of 27,665 posts from the social network service gab.com, each annotated for the presence of “hate-based rhetoric” by a minimum of three annotators. Posts were labeled according to a coding typology derived from a synthesis of hate speech definitions across legal precedent, previous hate speech coding typologies, and definitions from psychology and sociology, comprising hierarchical labels indicating dehumanizing and violent speech as well as indicators of targeted groups and rhetorical framing. We provide inter-annotator agreement statistics and perform a classification analysis in order to validate the corpus and establish performance baselines. The GHC complements existing hate speech datasets in its theoretical grounding and by providing a large, representative sample of richly annotated social media posts.
ISSN:	1574-020X 1574-0218
DOI:	10.1007/s10579-021-09569-x