Towards WinoQueer: Developing a Benchmark for Anti-Queer Bias in Large Language Models

  title={Towards WinoQueer: Developing a Benchmark for Anti-Queer Bias in Large Language Models},
  author={Virginia K. Felkner and Ho-Chun Herbert Chang and Eugene Jang and Jonathan May},
This paper presents exploratory work on whether and to what extent biases against queer and trans people are encoded in large language models (LLMs) such as BERT. We also propose a method for reducing these biases in downstream tasks: finetuning the models on data written by and/or about queer people. To measure anti-queer bias, we introduce a new benchmark dataset, WinoQueer, modeled after other bias-detection benchmarks but addressing homophobic and transphobic biases. We found that BERT shows… 

Tables from this paper



The Risk of Racial Bias in Hate Speech Detection

This work proposes *dialect* and *race priming* as ways to reduce the racial bias in annotation, showing that when annotators are made explicitly aware of an AAE tweet’s dialect they are significantly less likely to label the tweet as offensive.

CrowS-Pairs: A Challenge Dataset for Measuring Social Biases in Masked Language Models

It is found that all three of the widely-used MLMs the authors evaluate substantially favor sentences that express stereotypes in every category in CrowS-Pairs, a benchmark for measuring some forms of social bias in language models against protected demographic groups in the US.

StereoSet: Measuring stereotypical bias in pretrained language models

StereoSet, a large-scale natural English dataset to measure stereotypical biases in four domains: gender, profession, race, and religion, is presented and it is shown that popular models like BERT, GPT-2, RoBERTa, and XLnet exhibit strong stereotypical biases.

On Measuring Social Biases in Sentence Encoders

The Word Embedding Association Test is extended to measure bias in sentence encoders and mixed results including suspicious patterns of sensitivity that suggest the test’s assumptions may not hold in general.

Racial Bias in Hate Speech and Abusive Language Detection Datasets

Evidence of systematic racial bias in five different sets of Twitter data annotated for hate speech and abusive language is examined, as classifiers trained on them tend to predict that tweets written in African-American English are abusive at substantially higher rates.

Toward Gender-Inclusive Coreference Resolution

Through these studies, conducted on English text, it is confirmed that without acknowledging and building systems that recognize the complexity of gender, the authors build systems that lead to many potential harms.

Like Trainer, Like Bot? Inheritance of Bias in Algorithmic Content Moderation

This paper provides some exploratory methods by which the normative biases of algorithmic content moderation systems can be measured, by way of a case study using an existing dataset of comments labelled for offence.

Language (Technology) is Power: A Critical Survey of “Bias” in NLP

A greater recognition of the relationships between language and social hierarchies is urged, encouraging researchers and practitioners to articulate their conceptualizations of “bias” and to center work around the lived experiences of members of communities affected by NLP systems.

Mind the GAP: A Balanced Corpus of Gendered Ambiguous Pronouns

GAP, a gender-balanced labeled corpus of 8,908 ambiguous pronoun–name pairs sampled, is presented and released to provide diverse coverage of challenges posed by real-world text and shows that syntactic structure and continuous neural models provide promising, complementary cues for approaching the challenge.

Measuring and Mitigating Unintended Bias in Text Classification

A new approach to measuring and mitigating unintended bias in machine learning models is introduced, using a set of common demographic identity terms as the subset of input features on which to measure bias.