• Corpus ID: 237142520

Mitigating harm in language models with conditional-likelihood filtration

  title={Mitigating harm in language models with conditional-likelihood filtration},
  author={Helen Ngo and Cooper D. Raterink and Joao M. de Ara'ujo and Ivan Zhang and Carol Chen and Adrien Morisot and Nick Frosst},
Language models trained on large-scale unfiltered datasets curated from the open web acquire systemic biases, prejudices, and harmful views from their training data. We present a methodology for programmatically identifying and removing harmful text from web-scale datasets. A pretrained language model is used to assess the loglikelihood of researcher-written trigger phrases conditioned on a specific document, which is used to identify and filter documents from the dataset. We demonstrate that… 

Figures and Tables from this paper

A General Language Assistant as a Laboratory for Alignment

A ‘preference model pre-training’ stage of training is studied, with the goal of improving sample efficiency when finetuning on human preferences, and investigating scaling trends for several training objectives relevant to alignment.

No News is Good News: A Critique of the One Billion Word Benchmark

It is suggested that the temporal nature of news and its distribution shift over time makes it poorly suited for measuring language modeling ability, and potential impact and considerations for researchers building language models and evaluation datasets are discussed.

Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model

The infrastructure as well as the 3D parallelism methodology used to train the largest monolithic transformer based language model, Megatron-Turing NLG 530B (MT-NLG), with 530 billion parameters is presented.

SafetyKit: First Aid for Measuring Safety in Open-domain Conversational Systems

This position paper surveys the problem of safety for end-to-end conversational AI, introducing a taxonomy of three observed phenomena: the Instigator, Yea-Sayer, and Impostor effects, and empirically assess the extent to which current tools can measure these effects and current systems display them.

From plane crashes to algorithmic harm: applicability of safety engineering frameworks for responsible ML

Inappropriate design and deployment of machine learning (ML) systems leads to negative downstream social and ethical impact – described here as social and ethical risks – for users, society and the



Language Models are Unsupervised Multitask Learners

It is demonstrated that language models begin to learn these tasks without any explicit supervision when trained on a new dataset of millions of webpages called WebText, suggesting a promising path towards building language processing systems which learn to perform tasks from their naturally occurring demonstrations.

Process for Adapting Language Models to Society (PALMS) with Values-Targeted Datasets

A Process for Adapting Language Models to Society (PALMS) with Values-Targeted Datasets is proposed, an iterative process to significantly change model behavior by crafting andtuning on a dataset that reflects a predetermined set of target values.

Universal Adversarial Triggers for Attacking and Analyzing NLP

Adversarial examples highlight model vulnerabilities and are useful for evaluation and interpretation. We define universal adversarial triggers: input-agnostic sequences of tokens that trigger a

Censorship of Online Encyclopedias: Implications for NLP Models

It is shown how government repression, censorship, and self-censorship may impact training data and the applications that draw from them.

One billion word benchmark for measuring progress in statistical language modeling

A new benchmark corpus to be used for measuring progress in statistical language modeling, with almost one billion words of training data, is proposed, which is useful to quickly evaluate novel language modeling techniques, and to compare their contribution when combined with other advanced techniques.

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

This systematic study compares pre-training objectives, architectures, unlabeled datasets, transfer approaches, and other factors on dozens of language understanding tasks and achieves state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more.

On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜

Recommendations including weighing the environmental and financial costs first, investing resources into curating and carefully documenting datasets rather than ingesting everything on the web, and carrying out pre-development exercises evaluating how the planned approach fits into research and development goals and supports stakeholder values are provided.

Pointer Sentinel Mixture Models

The pointer sentinel-LSTM model achieves state of the art language modeling performance on the Penn Treebank while using far fewer parameters than a standard softmax LSTM and the freely available WikiText corpus is introduced.

Documenting the English Colossal Clean Crawled Corpus

This work provides some of the first documentation of the English Colossal Clean Crawled Corpus (C4), one of the largest corpora of text available, and hosts an indexed version of C4 at https://c4-search.allenai.org/, allowing anyone to search it.

The Risk of Racial Bias in Hate Speech Detection

This work proposes *dialect* and *race priming* as ways to reduce the racial bias in annotation, showing that when annotators are made explicitly aware of an AAE tweet’s dialect they are significantly less likely to label the tweet as offensive.