Corpus ID: 237513578

Challenges in Detoxifying Language Models

  title={Challenges in Detoxifying Language Models},
  author={Johannes Welbl and Amelia Glaese and Jonathan Uesato and Sumanth Dathathri and John F. J. Mellor and Lisa Anne Hendricks and Kirsty Anderson and Pushmeet Kohli and Ben Coppin and Po-Sen Huang},
Large language models (LM) generate remarkably fluent text and can be efficiently adapted across NLP tasks. Measuring and guaranteeing the quality of generated text in terms of safety is imperative for deploying LMs in the real world; to this end, prior work often relies on automatic evaluation of LM toxicity. We critically discuss this approach, evaluate several toxicity mitigation strategies with respect to both automatic and human evaluation, and analyze consequences of toxicity mitigation… Expand


RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models
It is found that pretrained LMs can degenerate into toxic text even from seemingly innocuous prompts, and empirically assess several controllable generation methods find that while data- or compute-intensive methods are more effective at steering away from toxicity than simpler solutions, no current method is failsafe against neural toxic degeneration. Expand
Detoxifying Language Models Risks Marginalizing Minority Voices
It is found that detoxification makes LMs more brittle to distribution shift, especially on language used by marginalized groups, and the tension between the controllability and distributional robustness of LMs is highlighted. Expand
Challenges in Automated Debiasing for Toxic Language Detection
The findings show that debiasing a model trained on biased toxic language data is not as effective as simply relabeling the data to remove existing biases, and proposes an automatic, dialect-aware data correction method, as a proof-of-concept. Expand
Civil Rephrases Of Toxic Texts With Self-Supervised Transformers
Inspired by recent progress in unpaired sequence-to-sequence tasks, a self-supervised learning model is introduced, called CAE-T5, which generates sentences that are more fluent and better at preserving the initial content compared to earlier text style transfer systems. Expand
BOLD: Dataset and Metrics for Measuring Biases in Open-Ended Language Generation
To systematically study and benchmark social biases in open-ended language generation, the Bias in Open-Ended Language Generation Dataset (BOLD) is introduced, a large-scale dataset that consists of 23,679 English text generation prompts for bias benchmarking across five domains: profession, gender, race, religion, and political ideology. Expand
Language Models are Unsupervised Multitask Learners
It is demonstrated that language models begin to learn these tasks without any explicit supervision when trained on a new dataset of millions of webpages called WebText, suggesting a promising path towards building language processing systems which learn to perform tasks from their naturally occurring demonstrations. Expand
Self-Diagnosis and Self-Debiasing: A Proposal for Reducing Corpus-Based Bias in NLP
This paper investigates whether pretrained language models at least know when they exhibit some undesirable bias or produce toxic content and proposes a decoding algorithm that reduces the probability of a model producing problematic text given only a textual description of the undesired behavior. Expand
The Curious Case of Neural Text Degeneration
By sampling text from the dynamic nucleus of the probability distribution, which allows for diversity while effectively truncating the less reliable tail of the distribution, the resulting text better demonstrates the quality of human text, yielding enhanced diversity without sacrificing fluency and coherence. Expand
The LAMBADA dataset: Word prediction requiring a broad discourse context
It is shown that LAMBADA exemplifies a wide range of linguistic phenomena, and that none of several state-of-the-art language models reaches accuracy above 1% on this novel benchmark. Expand
Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks
It is consistently found that multi-phase adaptive pretraining offers large gains in task performance, and it is shown that adapting to a task corpus augmented using simple data selection strategies is an effective alternative, especially when resources for domain-adaptive pretraining might be unavailable. Expand