The Risk of Racial Bias in Hate Speech Detection

  title={The Risk of Racial Bias in Hate Speech Detection},
  author={Maarten Sap and Dallas Card and Saadia Gabriel and Yejin Choi and Noah A. Smith},
We investigate how annotators’ insensitivity to differences in dialect can lead to racial bias in automatic hate speech detection models, potentially amplifying harm against minority populations. [] Key Result Finally, we propose dialect and race priming as ways to reduce the racial bias in annotation, showing that when annotators are made explicitly aware of an AAE tweet’s dialect they are significantly less likely to label the tweet as offensive.

Figures and Tables from this paper

Intersectional Bias in Hate Speech and Abusive Language Datasets

This study provides the first systematic evidence on intersectional bias in datasets of hate speech and abusive language in social media using a publicly available annotated Twitter dataset.

Demoting Racial Bias in Hate Speech Detection

Experimental results suggest that the adversarial training method used in this paper is able to substantially reduce the false positive rate for AAE text while only minimally affecting the performance of hate speech classification.

Mitigating Racial Bias in Social Media Hate Speech Detection

It is established that bias against users using African American English (AAE) exist in hate speech detection models and a literature review on current approaches to reduce such bias is provided and a lexical and syntactic alternations are proposed to remove protected attributes of AAE before training.

Analyzing Hate Speech Data along Racial, Gender and Intersectional Axes

This work identifies strong bias against African American English, masculine and AAE+Masculine tweets, which are annotated as disproportionately more hateful and offensive than from other demographics and shows that balancing the training data for these protected attributes can lead to fairer models with regards to gender, but not race.

Examining Racial Bias in an Online Abuse Corpus with Structural Topic Modeling

It is found that certain topics are disproportionately racialized and considered abusive in social media posts and how the prevalence of different topics is related to both abusiveness annotation and dialect prediction is examined.

Hate speech detection and racial bias mitigation in social media based on BERT model

A transfer learning approach for hate speech detection based on an existing pre-trained language model called BERT (Bidirectional Encoder Representations from Transformers) and a bias alleviation mechanism to mitigate the effect of bias in training set during the fine-tuning of the proposed model.

Differential Tweetment: Mitigating Racial Dialect Bias in Harmful Tweet Detection

It is found that when bias mitigation is employed, a high degree of predictive accuracy is maintained relative to baseline, and in many cases bias against AAE in harmful tweet predictions is reduced, however, the specific effects of these interventions on bias and performance vary widely between dataset contexts.

Contextualizing Hate Speech Classifiers with Post-hoc Explanation

This work extracts post-hoc explanations from fine-tuned BERT classifiers to detect bias towards identity terms and proposes a novel regularization technique based on these explanations that encourages models to learn from the context of group identifiers in addition to the identifiers themselves.

Detection of Social Biases in Hate Speech and Offensive Text

This report attempts to summarise what are existing hate speech detection and offensive text detection models are, and why hate speech models struggle to generalise, which sums up existing attempts at addressing the main obstacles.

Mitigating Biases in Toxic Language Detection through Invariant Rationalization

InvRat is proposed, a game-theoretic framework consisting of a rationale generator and a predictor, to rule out the spurious correlation of certain syntactic patterns to toxicity labels and yields lower false positive rate in both lexical and dialectal attributes than previous debiasing methods.



Reducing Gender Bias in Abusive Language Detection

Three mitigation methods, including debiased word embeddings, gender swap data augmentation, and fine-tuning with a larger corpus, can effectively reduce model bias by 90-98% and can be extended to correct model bias in other scenarios.

Locate the Hate: Detecting Tweets against Blacks

A supervised machine learning approach is applied, employing inexpensively acquired labeled data from diverse Twitter accounts to learn a binary classifier for the labels “racist” and “nonracist", which has a 76% average accuracy on individual tweets, suggesting that with further improvements, this work can contribute data on the sources of anti-black hate speech.

Hateful Symbols or Hateful People? Predictive Features for Hate Speech Detection on Twitter

A list of criteria founded in critical race theory is provided, and these are used to annotate a publicly available corpus of more than 16k tweets and present a dictionary based the most indicative words in the data.

Demographic Dialectal Variation in Social Media: A Case Study of African-American English

A case study of dialectal language in online conversational text by investigating African-American English (AAE) on Twitter and proposes a distantly supervised model to identify AAE-like language from demographics associated with geo-located messages, and verifies that this language follows well-known AAE linguistic phenomena.

Comparative Studies of Detecting Abusive Language on Twitter

This paper conducts the first comparative study of various learning models on Hate and Abusive Speech on Twitter, and shows that bidirectional GRU networks trained on word-level features, with Latent Topic Clustering modules, is the most accurate model.

Automated Hate Speech Detection and the Problem of Offensive Language

This work used a crowd-sourced hate speech lexicon to collect tweets containing hate speech keywords and labels a sample of these tweets into three categories: those containinghate speech, only offensive language, and those with neither.

Cyber Hate Classification: 'Othering' Language And Paragraph Embedding

A novel 'Othering Lexicon' is proposed to identify subtle language use, such as references to immigration or job prosperity in a hateful context, and is incorporated with embedding learning for feature extraction and subsequent classification using a neural network approach.

Characterizing and Detecting Hateful Users on Twitter

This work develops and employs a robust methodology to collect and annotate hateful users which does not depend directly on lexicon and where the users are annotated given their entire profile, and forms the hate speech detection problem as a task of semi-supervised learning over a graph, exploiting the network of connections on Twitter.

Examining a hate speech corpus for hate speech detection and popularity prediction

A critical look at the training corpus is taken in order to understand its biases, while also using it to venture beyond hate speech detection and investigate whether it can be used to shed light on other facets of research, such as popularity of hate tweets.

Measuring and Mitigating Unintended Bias in Text Classification

A new approach to measuring and mitigating unintended bias in machine learning models is introduced, using a set of common demographic identity terms as the subset of input features on which to measure bias.