Contextualizing Hate Speech Classifiers with Post-hoc Explanation

@inproceedings{Kennedy2020ContextualizingHS,
  title={Contextualizing Hate Speech Classifiers with Post-hoc Explanation},
  author={Brendan Kennedy and Xisen Jin and Aida Mostafazadeh Davani and Morteza Dehghani and Xiang Ren},
  booktitle={ACL},
  year={2020}
}
Hate speech classifiers trained on imbalanced datasets struggle to determine if group identifiers like “gay” or “black” are used in offensive or prejudiced ways. Such biases manifest in false positives when these identifiers are present, due to models’ inability to learn the contexts which constitute a hateful usage of identifiers. We extract post-hoc explanations from fine-tuned BERT classifiers to detect bias towards identity terms. Then, we propose a novel regularization technique based on… Expand
A Survey of Race, Racism, and Anti-Racism in NLP
AAA: Fair Evaluation for Abuse Detection Systems Wanted
HONEST: Measuring Hurtful Sentence Completion in Language Models
Refining Neural Networks with Compositional Explanations
Towards generalisable hate speech detection: a review on obstacles and solutions
Bots and online hate during the COVID-19 pandemic: case studies in the United States and the Philippines
...
1
2
...

References

SHOWING 1-10 OF 45 REFERENCES
Hate Speech Dataset from a White Supremacy Forum
A Survey of Methods for Explaining Black Box Models
Bound in hatred: The role of group-based morality in acts of hate
  • PsyArxiv
  • 2019
Counterfactual Fairness in Text Classification through Robustness
Detection of Abusive Language: the Problem of Biased Datasets
Hate speech detection: Challenges and solutions
...
1
2
3
4
5
...