RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models

  title={RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models},
  author={Samuel Gehman and Suchin Gururangan and Maarten Sap and Yejin Choi and Noah A. Smith},
Pretrained neural language models (LMs) are prone to generating racist, sexist, or otherwise toxic language which hinders their safe deployment. We investigate the extent to which pretrained LMs can be prompted to generate toxic language, and the effectiveness of controllable text generation algorithms at preventing such toxic degeneration. We create and release RealToxicityPrompts, a dataset of 100K naturally occurring, sentence-level prompts derived from a large corpus of English web text… 

Reward Modeling for Mitigating Toxicity in Transformer-based Language Models

The experiments demonstrate that the Reinforce-Detoxify method for language model detoxi-cation outperforms existing detoxification approaches in automatic evaluation metrics, indicating that the approach is less prone to unintended bias toward social identities in generated content.

Simple Text Detoxification by Identifying a Linear Toxic Subspace in Language Model Embeddings

It is demonstrated empirically that the subspace found using the proposed method to generalize toxic directions in the latent space generalizes to multiple toxicity corpora, indicating the existence of a low-dimensional toxic subspace.

ToxiGen: A Large-Scale Machine-Generated Dataset for Adversarial and Implicit Hate Speech Detection

ToxiGen, a new large-scale and machine-generated dataset of 274k toxic and benign statements about 13 minority groups, is created and it is demonstrated that finetuning a toxicity classifier on data improves its performance on human-written data substantially.

Leashing the Inner Demons: Self-Detoxification for Language Models

This paper proposes a simple yet effective unsupervised method for language models to ``detoxify'' themselves without an additional large corpus or external discriminator, and shows better toxicity reduction with good generation quality in the generated content under multiple settings.

Exploring the Limits of Domain-Adaptive Training for Detoxifying Large-Scale Language Models

This work systematically explore domain-adaptive training to reduce the toxicity of language models and demonstrates that adding and training adapter-only layers in LMs not only saves a lot of parameters but also achieves a better trade-off between toxicity and perplexity than whole model adaptation for the large-scale models.

Challenges in Detoxifying Language Models

It is demonstrated that while basic intervention strategies can effectively optimize previously established automatic metrics on the REALTOXICITYPROMPTS dataset, this comes at the cost of reduced LM coverage for both texts about, and dialects of, marginalized groups.

Mitigating Toxic Degeneration with Empathetic Data: Exploring the Relationship Between Toxicity and Empathy

Using empathetic data, this work is able to dramatically reduce the size of fine-tuning data to 7.5-30k samples while at the same time making significant improvements over state-of-the-art toxicity mitigation of up to 3.4% absolute reduction from the original work on 2.3m samples.

Challenges in Automated Debiasing for Toxic Language Detection

The findings show that debiasing a model trained on biased toxic language data is not as effective as simply relabeling the data to remove existing biases, and proposes an automatic, dialect-aware data correction method, as a proof-of-concept.

Probing Toxic Content in Large Pre-Trained Language Models

A method based on logistic regression classifiers is proposed to probe English, French, and Arabic PTLMs and quantify the potentially harmful content that they convey with respect to a set of templates to assess and mitigate the toxicity transmitted by PTL Ms.

Detoxifying Language Models with a Toxic Corpus

The result shows that toxic corpus can indeed help to reduce the toxicity of the language generation process substantially, complementing the existing debiasing methods.



The Curious Case of Neural Text Degeneration

By sampling text from the dynamic nucleus of the probability distribution, which allows for diversity while effectively truncating the less reliable tail of the distribution, the resulting text better demonstrates the quality of human text, yielding enhanced diversity without sacrificing fluency and coherence.

Universal Adversarial Triggers for Attacking and Analyzing NLP

Adversarial examples highlight model vulnerabilities and are useful for evaluation and interpretation. We define universal adversarial triggers: input-agnostic sequences of tokens that trigger a

The Radicalization Risks of GPT-3 and Advanced Neural Language Models

GPT-3 demonstrates significant improvement over its predecessor, GPT-2, in generating extremist texts and its strength in generating text that accurately emulates interactive, informational, and influential content that could be utilized for radicalizing individuals into violent far-right extremist ideologies and behaviors.

Defending Against Neural Fake News

A model for controllable text generation called Grover, found that best current discriminators can classify neural fake news from real, human-written, news with 73% accuracy, assuming access to a moderate level of training data, and the best defense against Grover turns out to be Grover itself, with 92% accuracy.

Neural Text Generation with Unlikelihood Training

It is shown that the likelihood objective itself is at fault, resulting in a model that assigns too much probability to sequences containing repeats and frequent words, unlike those from the human training distribution, thus providing a strong alternative to existing techniques.

The Secret Sharer: Evaluating and Testing Unintended Memorization in Neural Networks

This paper describes a testing methodology for quantitatively assessing the risk that rare or unique training-data sequences are unintentionally memorized by generative sequence models---a common type of machine-learning model, and describes new, efficient procedures that can extract unique, secret sequences, such as credit card numbers.

PowerTransformer: Unsupervised Controllable Revision for Biased Language Correction

Unconscious biases continue to be prevalent in modern text and media, calling for algorithms that can assist writers with bias correction. For example, a female character in a story is often

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

This systematic study compares pre-training objectives, architectures, unlabeled datasets, transfer approaches, and other factors on dozens of language understanding tasks and achieves state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more.

Fine-Tuning Language Models from Human Preferences

This paper builds on advances in generative pretraining of language models to apply reward learning to four natural language tasks: continuing text with positive sentiment or physically descriptive language, and summarization tasks on the TL;DR and CNN/Daily Mail datasets.

Plug and Play Language Models: A Simple Approach to Controlled Text Generation

The Plug and Play Language Model (PPLM) for controllable language generation is proposed, which combines a pretrained LM with one or more simple attribute classifiers that guide text generation without any further training of the LM.