HateCheck: Functional Tests for Hate Speech Detection Models

@inproceedings{Rttger2021HateCheckFT,
  title={HateCheck: Functional Tests for Hate Speech Detection Models},
  author={Paul R{\"o}ttger and Bertram Vidgen and Dong Nguyen and Zeerak Waseem and Helen Z. Margetts and Janet B. Pierrehumbert},
  booktitle={ACL/IJCNLP},
  year={2021}
}
Detecting online hate is a difficult task that even state-of-the-art models struggle with. In previous research, hate speech detection models are typically evaluated by measuring their performance on held-out test data using metrics such as accuracy and F1 score. However, this approach makes it difficult to identify specific model weak points. It also risks overestimating generalisable model quality due to increasingly well-evidenced systematic gaps and biases in hate speech datasets. To enable… Expand

Tables from this paper

AAA: Fair Evaluation for Abuse Detection Systems Wanted
TLDR
Adversarial Attacks against Abuse (AAA), a new evaluation strategy and associated metric that better captures a model’s performance on certain classes of hard-to-classify microposts, and penalises systems which are biased on low-level lexical features, is introduced. Expand
Hatemoji: A Test Suite and Adversarially-Generated Dataset for Benchmarking and Detecting Emoji-based Hate
TLDR
This work presents HATEMOJICHECK, a test suite of 3,930 short-form statements that allows us to evaluate performance on hateful language expressed with emoji, and creates the HATemOJITRAIN dataset using a human-and-model-in-the-loop approach to address weaknesses in existing hate detection models. Expand
Learning from the Worst: Dynamically Generated Datasets to Improve Online Hate Detection
TLDR
It is shown that model performance is substantially improved using this approach, and models trained on later rounds of data collection perform better on test sets and are harder for annotators to trick. Expand
Multilingual Offensive Language Identification for Low-resource Languages
TLDR
Results for all languages confirm the robustness of cross-lingual contextual embeddings and transfer learning for this task, and project predictions on comparable data in Arabic, Bengali, Danish, Greek, Hindi, Spanish, and Turkish. Expand
An Information Retrieval Approach to Building Datasets for Hate Speech Detection
TLDR
The key insight is that the rarity and subjectivity of hate speech are akin to that of relevance in information retrieval (IR) and this connection suggests that well-established methodologies for creating IR test collections might also be usefully applied to create better benchmark datasets for hate speech detection. Expand
An Information Retrieval Approach to Building Datasets for Hate Speech Detection
Building a benchmark dataset for hate speech detection presents various challenges. Firstly, because hate speech is relatively rare, random sampling of tweets to annotate is very inefficient inExpand
Anticipating Safety Issues in E2E Conversational AI: Framework and Tooling
TLDR
This paper surveys the problem landscape for safety for end-to-end conversational AI, highlights tensions between values, potential positive impact and potential harms, and provides a framework for making decisions about whether and how to release these models, following the tenets of value-sensitive design. Expand
BBQ: A Hand-Built Bias Benchmark for Question Answering
TLDR
The Bias Benchmark for QA (BBQ), a dataset consisting of question-sets constructed by the authors that highlight attested social biases against people belonging to protected classes along nine different social dimensions relevant for U.S. English-speaking contexts, is introduced. Expand
Confronting Abusive Language Online: A Survey from the Ethical and Human Rights Perspective
TLDR
Several opportunities for rights-respecting, socio-technical solutions to detect and confront online abuse are identified, including ‘nudging’, ‘quarantining‘, value sensitive design, counter-narratives, style transfer, and AI-driven public education applications. Expand
Dynabench: Rethinking Benchmarking in NLP
TLDR
It is argued that Dynabench addresses a critical need in the community: contemporary models quickly achieve outstanding performance on benchmark tasks but nonetheless fail on simple challenge examples and falter in real-world scenarios. Expand
...
1
2
...

References

SHOWING 1-10 OF 110 REFERENCES
DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter
TLDR
This work proposes a method to pre-train a smaller general-purpose language representation model, called DistilBERT, which can be fine-tuned with good performances on a wide range of tasks like its larger counterparts, and introduces a triple loss combining language modeling, distillation and cosine-distance losses. Expand
Evaluating Models’ Local Decision Boundaries via Contrast Sets
TLDR
A more rigorous annotation paradigm for NLP that helps to close systematic gaps in the test data, and recommends that the dataset authors manually perturb the test instances in small but meaningful ways that (typically) change the gold label, creating contrast sets. Expand
Measuring and Mitigating Unintended Bias in Text Classification
TLDR
A new approach to measuring and mitigating unintended bias in machine learning models is introduced, using a set of common demographic identity terms as the subset of input features on which to measure bias. Expand
Automated Hate Speech Detection and the Problem of Offensive Language
TLDR
This work used a crowd-sourced hate speech lexicon to collect tweets containing hate speech keywords and labels a sample of these tweets into three categories: those containinghate speech, only offensive language, and those with neither. Expand
A Unified Taxonomy of Harmful Content
TLDR
The most common types of abuse described by industry, policy, community and health experts are synthesized into a unified typology of harmful content, with detailed criteria and exceptions for each type of abuse. Expand
BLiMP: A Benchmark of Linguistic Minimal Pairs for English
TLDR
The Benchmark of Linguistic Minimal Pairs, a challenge set for evaluating the linguistic knowledge of language models (LMs) on major grammatical phenomena in English, finds that state-of-the-art models identify morphological contrasts related to agreement reliably, but they struggle with some subtle semantic and syntactic phenomena. Expand
Beyond Accuracy: Behavioral Testing of NLP Models with CheckList
Although measuring held-out accuracy has been the primary approach to evaluate generalization, it often overestimates the performance of NLP models, while alternative approaches for evaluating modelsExpand
Cold: Annotation scheme and evaluation data set for complex offensive language in english
  • JLCL.
  • 2020
HABERTOR: An Efficient and Effective Deep Hatespeech Detector
TLDR
The generalizability analysis shows that HABERTOR transfers well to other unseen hatespeech datasets and is a more efficient and effective alternative to BERT for the hatepeech classification. Expand
Towards a Comprehensive Taxonomy and Large-Scale Annotated Corpus for Online Slur Usage
TLDR
This work provides an annotation guide that outlines 4 main categories of online slur usage, which are further divided into a total of 12 sub-categories and presents a publicly available corpus based on this taxonomy, allowing researchers to evaluate classifiers on a wider range of speech containing slurs. Expand
...
1
2
3
4
5
...