The Risk of Racial Bias in Hate Speech Detection

@inproceedings{Sap2019TheRO,
  title={The Risk of Racial Bias in Hate Speech Detection},
  author={Maarten Sap and D. Card and Saadia Gabriel and Yejin Choi and Noah A. Smith},
  booktitle={ACL},
  year={2019}
}
We investigate how annotators’ insensitivity to differences in dialect can lead to racial bias in automatic hate speech detection models, potentially amplifying harm against minority populations. [...] Key Result Finally, we propose dialect and race priming as ways to reduce the racial bias in annotation, showing that when annotators are made explicitly aware of an AAE tweet’s dialect they are significantly less likely to label the tweet as offensive.Expand
Demoting Racial Bias in Hate Speech Detection
TLDR
Experimental results suggest that the adversarial training method used in this paper is able to substantially reduce the false positive rate for AAE text while only minimally affecting the performance of hate speech classification. Expand
Intersectional Bias in Hate Speech and Abusive Language Datasets
TLDR
This study provides the first systematic evidence on intersectional bias in datasets of hate speech and abusive language in social media using a publicly available annotated Twitter dataset. Expand
Examining Racial Bias in an Online Abuse Corpus with Structural Topic Modeling
TLDR
It is found that certain topics are disproportionately racialized and considered abusive in social media posts and how the prevalence of different topics is related to both abusiveness annotation and dialect prediction is examined. Expand
Mitigating Biases in Toxic Language Detection through Invariant Rationalization
TLDR
InvRat is proposed, a game-theoretic framework consisting of a rationale generator and a predictor, to rule out the spurious correlation of certain syntactic patterns to toxicity labels and yields lower false positive rate in both lexical and dialectal attributes than previous debiasing methods. Expand
Reconsidering Annotator Disagreement about Racist Language: Noise or Signal?
TLDR
It is shown that White and non-White annotators exhibit significant differences in ratings when reading tweets with high prevalence of particular, racially-charged topics, and future methodological work can draw on the results and further incorporate social science theory into analyses. Expand
Detecting East Asian Prejudice on Social Media
TLDR
A new dataset and the creation of a machine learning classifier that categorizes social media posts from Twitter into four classes: Hostility against East Asia, Criticism of EastAsia, Meta-discussions of East Asian prejudice, and a neutral class are reported. Expand
Fine-Grained Fairness Analysis of Abusive Language Detection Systems with CheckList
Current abusive language detection systems have demonstrated unintended bias towards sensitive features such as nationality or gender. This is a crucial issue, which may harm minorities andExpand
Identifying and Measuring Annotator Bias Based on Annotators’ Demographic Characteristics
TLDR
This work investigates annotator bias using classification models trained on data from demographically distinct annotator groups, and shows that demographic features, such as first language, age, and education, correlate with significant performance differences. Expand
Multilingual Twitter Corpus and Baselines for Evaluating Demographic Bias in Hate Speech Recognition
TLDR
This work assemble and publish a multilingual Twitter corpus for the task of hate speech detection with inferred four author demographic factors: age, country, gender and race/ethnicity, and measures the performance of four popular document classifiers and evaluates the fairness and bias of the baseline classifiers on the author-level demographic attributes. Expand
Cross-geographic Bias Detection in Toxicity Modeling
TLDR
A weakly supervised method to robustly detect lexical biases in broader geocultural contexts is introduced and it is demonstrated that these groupings reflect human judgments of offensive and inoffensive language in those geographic contexts. Expand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 42 REFERENCES
Reducing Gender Bias in Abusive Language Detection
TLDR
Three mitigation methods, including debiased word embeddings, gender swap data augmentation, and fine-tuning with a larger corpus, can effectively reduce model bias by 90-98% and can be extended to correct model bias in other scenarios. Expand
Locate the Hate: Detecting Tweets against Blacks
TLDR
A supervised machine learning approach is applied, employing inexpensively acquired labeled data from diverse Twitter accounts to learn a binary classifier for the labels "racist" and "nonracist", suggesting that with further improvements, this work can contribute data on the sources of anti-black hate speech. Expand
Hateful Symbols or Hateful People? Predictive Features for Hate Speech Detection on Twitter
TLDR
A list of criteria founded in critical race theory is provided, and these are used to annotate a publicly available corpus of more than 16k tweets and present a dictionary based the most indicative words in the data. Expand
Demographic Dialectal Variation in Social Media: A Case Study of African-American English
TLDR
A case study of dialectal language in online conversational text by investigating African-American English (AAE) on Twitter and proposes a distantly supervised model to identify AAE-like language from demographics associated with geo-located messages, and verifies that this language follows well-known AAE linguistic phenomena. Expand
Bridging the Gaps: Multi Task Learning for Domain Transfer of Hate Speech Detection
TLDR
This paper investigates methods for bridging differences in annotation and data collection of abusive language tweets such as different annotation schemes, labels, or geographic and cultural influences from data sampling, and considers three distinct sets of annotations. Expand
Comparative Studies of Detecting Abusive Language on Twitter
TLDR
This paper conducts the first comparative study of various learning models on Hate and Abusive Speech on Twitter, and shows that bidirectional GRU networks trained on word-level features, with Latent Topic Clustering modules, is the most accurate model. Expand
Automated Hate Speech Detection and the Problem of Offensive Language
TLDR
This work used a crowd-sourced hate speech lexicon to collect tweets containing hate speech keywords and labels a sample of these tweets into three categories: those containinghate speech, only offensive language, and those with neither. Expand
Cyber Hate Classification: 'Othering' Language And Paragraph Embedding
TLDR
A novel 'Othering Lexicon' is proposed to identify subtle language use, such as references to immigration or job prosperity in a hateful context, and is incorporated with embedding learning for feature extraction and subsequent classification using a neural network approach. Expand
Characterizing and Detecting Hateful Users on Twitter
TLDR
This work develops and employs a robust methodology to collect and annotate hateful users which does not depend directly on lexicon and where the users are annotated given their entire profile, and forms the hate speech detection problem as a task of semi-supervised learning over a graph, exploiting the network of connections on Twitter. Expand
Examining a hate speech corpus for hate speech detection and popularity prediction
TLDR
A critical look at the training corpus is taken in order to understand its biases, while also using it to venture beyond hate speech detection and investigate whether it can be used to shed light on other facets of research, such as popularity of hate tweets. Expand
...
1
2
3
4
5
...