RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models

@article{Gehman2020RealToxicityPromptsEN,
  title={RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models},
  author={Samuel Gehman and Suchin Gururangan and Maarten Sap and Yejin Choi and Noah A. Smith},
  journal={ArXiv},
  year={2020},
  volume={abs/2009.11462}
}
Pretrained neural language models (LMs) are prone to generating racist, sexist, or otherwise toxic language which hinders their safe deployment. We investigate the extent to which pretrained LMs can be prompted to generate toxic language, and the effectiveness of controllable text generation algorithms at preventing such toxic degeneration. We create and release RealToxicityPrompts, a dataset of 100K naturally occurring, sentence-level prompts derived from a large corpus of English web text… 
Simple Text Detoxification by Identifying a Linear Toxic Subspace in Language Model Embeddings
TLDR
It is demonstrated empirically that the subspace found using the proposed method to generalize toxic directions in the latent space generalizes to multiple toxicity corpora, indicating the existence of a low-dimensional toxic subspace.
Leashing the Inner Demons: Self-Detoxification for Language Models
TLDR
This paper proposes a simple yet effective unsupervised method for language models to ``detoxify'' themselves without an additional large corpus or external discriminator, and shows better toxicity reduction with good generation quality in the generated content under multiple settings.
Exploring the Limits of Domain-Adaptive Training for Detoxifying Large-Scale Language Models
TLDR
This work systematically explore domain-adaptive training to reduce the toxicity of language models and demonstrates that adding and training adapter-only layers in LMs not only saves a lot of parameters but also achieves a better trade-off between toxicity and perplexity than whole model adaptation for the large-scale models.
Challenges in Detoxifying Language Models
TLDR
It is demonstrated that while basic intervention strategies can effectively optimize previously established automatic metrics on the REALTOXICITYPROMPTS dataset, this comes at the cost of reduced LM coverage for both texts about, and dialects of, marginalized groups.
Challenges in Automated Debiasing for Toxic Language Detection
TLDR
The findings show that debiasing a model trained on biased toxic language data is not as effective as simply relabeling the data to remove existing biases, and proposes an automatic, dialect-aware data correction method, as a proof-of-concept.
Self-Diagnosis and Self-Debiasing: A Proposal for Reducing Corpus-Based Bias in NLP
TLDR
This paper demonstrates a surprising finding: Pretrained language models recognize, to a considerable degree, their undesirable biases and the toxicity of the content they produce and proposes a decoding algorithm that reduces the probability of a language model producing problematic text, known as self-debiasing.
Robust Conversational Agents against Imperceptible Toxicity Triggers
TLDR
This work proposes attacks against conversational agents that are imperceptible, i.e., they fit the conversation in terms of coherency, relevancy, and fluency, while they are effective and scalable, and establishes the generalizability of such a defense mechanism on language generation models beyond Conversational agents.
TempLM: Distilling Language Models into Template-Based Generators
TLDR
On the E2E and SynthBio data-to-text datasets, it is shown that TempLM is more faithful than the original PLM and is more fluent than prior template systems.
Detoxifying Language Models Risks Marginalizing Minority Voices
TLDR
It is found that detoxification makes LMs more brittle to distribution shift, especially on language used by marginalized groups, and the tension between the controllability and distributional robustness of LMs is highlighted.
Factuality Enhanced Language Models for Open-Ended Text Generation
TLDR
This work measures and improves the factual accuracy of large-scale LMs for open-ended text generation, and proposes a factuality-enhanced training method that uses TOPICPREFIX for better awareness of facts and sentence completion as the training objective, which can vastly reduce the factual errors.
...
...

References

SHOWING 1-10 OF 89 REFERENCES
Defending Against Neural Fake News
TLDR
A model for controllable text generation called Grover, found that best current discriminators can classify neural fake news from real, human-written, news with 73% accuracy, assuming access to a moderate level of training data, and the best defense against Grover turns out to be Grover itself, with 92% accuracy.
Neural Text Generation with Unlikelihood Training
TLDR
It is shown that the likelihood objective itself is at fault, resulting in a model that assigns too much probability to sequences containing repeats and frequent words, unlike those from the human training distribution, thus providing a strong alternative to existing techniques.
The Secret Sharer: Evaluating and Testing Unintended Memorization in Neural Networks
TLDR
This paper describes a testing methodology for quantitatively assessing the risk that rare or unique training-data sequences are unintentionally memorized by generative sequence models---a common type of machine-learning model, and describes new, efficient procedures that can extract unique, secret sequences, such as credit card numbers.
PowerTransformer: Unsupervised Controllable Revision for Biased Language Correction
Unconscious biases continue to be prevalent in modern text and media, calling for algorithms that can assist writers with bias correction. For example, a female character in a story is often
Language Models are Unsupervised Multitask Learners
TLDR
It is demonstrated that language models begin to learn these tasks without any explicit supervision when trained on a new dataset of millions of webpages called WebText, suggesting a promising path towards building language processing systems which learn to perform tasks from their naturally occurring demonstrations.
Fine-Tuning Language Models from Human Preferences
TLDR
This paper builds on advances in generative pretraining of language models to apply reward learning to four natural language tasks: continuing text with positive sentiment or physically descriptive language, and summarization tasks on the TL;DR and CNN/Daily Mail datasets.
Plug and Play Language Models: A Simple Approach to Controlled Text Generation
TLDR
The Plug and Play Language Model (PPLM) for controllable language generation is proposed, which combines a pretrained LM with one or more simple attribute classifiers that guide text generation without any further training of the LM.
CTRL: A Conditional Transformer Language Model for Controllable Generation
TLDR
CTRL is released, a 1.63 billion-parameter conditional transformer language model, trained to condition on control codes that govern style, content, and task-specific behavior, providing more explicit control over text generation.
Overcoming catastrophic forgetting in neural networks
TLDR
It is shown that it is possible to overcome the limitation of connectionist models and train networks that can maintain expertise on tasks that they have not experienced for a long time and selectively slowing down learning on the weights important for previous tasks.
Discovering and Categorising Language Biases in Reddit
TLDR
A data-driven approach using word embeddings to discover and categorise language biases on the discussion platform Reddit, which successfully discovers gender bias, religion bias, and ethnic bias in different Reddit communities.
...
...