Challenges in Automated Debiasing for Toxic Language Detection

  title={Challenges in Automated Debiasing for Toxic Language Detection},
  author={Xuhui Zhou and Maarten Sap and Swabha Swayamdipta and Noah A. Smith and Yejin Choi},
Biased associations have been a challenge in the development of classifiers for detecting toxic language, hindering both fairness and accuracy. As potential solutions, we investigate recently introduced debiasing methods for text classification datasets and models, as applied to toxic language detection. Our focus is on lexical (e.g., swear words, slurs, identity mentions) and dialectal markers (specifically African American English). Our comprehensive experiments establish that existing… 

Figures and Tables from this paper

Mitigating Biases in Toxic Language Detection through Invariant Rationalization

InvRat is proposed, a game-theoretic framework consisting of a rationale generator and a predictor, to rule out the spurious correlation of certain syntactic patterns to toxicity labels and yields lower false positive rate in both lexical and dialectal attributes than previous debiasing methods.

Improving Generalizability in Implicitly Abusive Language Detection with Concept Activation Vectors

It is shown that general abusive language classifiers tend to be fairly reliable in detecting out-of-domain explicitly abusive utterances but fail to detect new types of more subtle, implicit abuse.

Bias Mitigation for Toxicity Detection via Sequential Decisions

This work studies debiasing toxicity detection with two aims: to examine whether different biases tend to correlate with each other; and to investigate how to jointly mitigate these correlated biases in an interactive manner to minimize the total amount of bias.

Cross-geographic Bias Detection in Toxicity Modeling

A weakly supervised method to robustly detect lexical biases in broader geocultural contexts is introduced and it is demonstrated that these groupings reflect human judgments of offensive and inoffensive language in those geographic contexts.

ToxiGen: A Large-Scale Machine-Generated Dataset for Adversarial and Implicit Hate Speech Detection

ToxiGen, a new large-scale and machine-generated dataset of 274k toxic and benign statements about 13 minority groups, is created and it is demonstrated that finetuning a toxicity classifier on data improves its performance on human-written data substantially.

Challenges in Detoxifying Language Models

It is demonstrated that while basic intervention strategies can effectively optimize previously established automatic metrics on the REALTOXICITYPROMPTS dataset, this comes at the cost of reduced LM coverage for both texts about, and dialects of, marginalized groups.

Detecting Cross-Geographic Biases in Toxicity Modeling on Social Media

A weakly supervised method to robustly detect lexical biases in broader geo-cultural contexts is introduced and it is demonstrated that these groupings reflect human judgments of offensive and inoffensive language in those geographic contexts.

M-BAD: A Multilabel Dataset for Detecting Aggressive Texts and Their Targets

A novel multilabel Bengali dataset (named M-BAD) containing 15650 texts to detect aggressive texts and their targets is introduced, which exhibits the difficulty to identify context-dependent aggression.

Toward Understanding Bias Correlations for Mitigation in NLP

Natural Language Processing (NLP) models have been found discriminative against groups of different social identities such as gender and race. With the negative consequences of these unde-sired

Handling Bias in Toxic Speech Detection: A Survey

The massive growth of social media usage has witnessed a tsunami of online toxicity in teams of hate speech, abusive posts, cyberbullying, etc. Detecting online toxicity is challenging due to its



Reducing Gender Bias in Abusive Language Detection

Three mitigation methods, including debiased word embeddings, gender swap data augmentation, and fine-tuning with a larger corpus, can effectively reduce model bias by 90-98% and can be extended to correct model bias in other scenarios.

Measuring and Mitigating Unintended Bias in Text Classification

A new approach to measuring and mitigating unintended bias in machine learning models is introduced, using a set of common demographic identity terms as the subset of input features on which to measure bias.

Empirical Analysis of Multi-Task Learning for Reducing Model Bias in Toxic Comment Detection

A multi-task learning model with an attention layer that jointly learns to predict the toxicity of a comment as well as the identities present in the comments in order to reduce model bias towards commonly-attacked identity groups.

RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models

It is found that pretrained LMs can degenerate into toxic text even from seemingly innocuous prompts, and empirically assess several controllable generation methods find that while data- or compute-intensive methods are more effective at steering away from toxicity than simpler solutions, no current method is failsafe against neural toxic degeneration.

Hate speech detection and racial bias mitigation in social media based on BERT model

A transfer learning approach for hate speech detection based on an existing pre-trained language model called BERT (Bidirectional Encoder Representations from Transformers) and a bias alleviation mechanism to mitigate the effect of bias in training set during the fine-tuning of the proposed model.

Racial Bias in Hate Speech and Abusive Language Detection Datasets

Evidence of systematic racial bias in five different sets of Twitter data annotated for hate speech and abusive language is examined, as classifiers trained on them tend to predict that tweets written in African-American English are abusive at substantially higher rates.

Unlearn Dataset Bias in Natural Language Inference by Fitting the Residual

This work formalizes the concept of dataset bias under the framework of distribution shift and presents a simple debiasing algorithm based on residual fitting, which is called DRiFt, to design learning algorithms that guard against known dataset bias.

The Risk of Racial Bias in Hate Speech Detection

This work proposes *dialect* and *race priming* as ways to reduce the racial bias in annotation, showing that when annotators are made explicitly aware of an AAE tweet’s dialect they are significantly less likely to label the tweet as offensive.

Demoting Racial Bias in Hate Speech Detection

Experimental results suggest that the adversarial training method used in this paper is able to substantially reduce the false positive rate for AAE text while only minimally affecting the performance of hate speech classification.

Intersectional Bias in Hate Speech and Abusive Language Datasets

This study provides the first systematic evidence on intersectional bias in datasets of hate speech and abusive language in social media using a publicly available annotated Twitter dataset.