Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

  title={Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback},
  author={Yushi Bai and Andy Jones and Kamal Ndousse and Amanda Askell and Anna Chen and Nova DasSarma and Dawn Drain and Stanislav Fort and Deep Ganguli and T. J. Henighan and Nicholas Joseph and Saurav Kadavath and John Kernion and Tom Conerly and Sheer El-Showk and Nelson Elhage and Zac Hatfield-Dodds and Danny Hernandez and Tristan Hume and Scott Johnston and Shauna Kravec and Liane Lovitt and Neel Nanda and Catherine Olsson and Dario Amodei and Tom B. Brown and Jack Clark and Sam McCandlish and Christopher Olah and Benjamin Mann and Jared Kaplan},
We apply preference modeling and reinforcement learning from human feedback (RLHF) to finetune language models to act as helpful and harmless assistants. We find this alignment training improves performance on almost all NLP evaluations, and is fully compatible with training for specialized skills such as python coding and summarization. We explore an iterated online mode of training, where preference models and RL policies are updated on a weekly cadence with fresh human feedback data, ef… 
Actionable Guidance for High-Consequence AI Risk Management: Towards Standards Addressing AI Catastrophic Risks
This document provides detailed actionable-guidance recommendations focused on identifying and managing risks of events with very high or catastrophic consequences, intended as a risk management practices resource for NIST for AI RMF version 1.0 (scheduled for release in early 2023).
RL with KL penalties is better viewed as Bayesian inference
This paper analyzes challenges associated with treating a language model as an RL policy and shows how avoiding those challenges requires moving beyond the RL paradigm, and shows that KL-regularised RL is equivalent to variational inference: approximating a Bayesian posterior which informs how to update a prior LM to conform with evidence provided by the reward function.
X-Risk Analysis for AI Research
A collection of time-tested concepts from hazard analysis and systems safety, which have been designed to steer large processes in safer directions are reviewed, to discuss how AI researchers can realistically have long-term impacts on the safety of AI systems.
Beyond Tabula Rasa: Reincarnating Reinforcement Learning
This work argues for an alternate approach to RL research, which could significantly improve real-world RL adoption and help democratize it further, and focuses on the specific setting of efficiently transferring an existing sub-optimal policy to a standalone valuebased RL agent.


Adversarial vulnerability of powerful near out-of-distribution detection
This work shows a severe adversarial vulnerability of even the strongest current OOD detection techniques, and studies the adversarial robustness of several post-processing techniques, including the simple baseline of Maximum of Softmax Probabilities (MSP), the Mahalanobis distance, and the newly proposed Relative MahalanOBis distance.
Teaching language models to support answers with verified quotes
This work uses reinforcement learning from human preferences to train “open-book” QA models that generate answers whilst also citing specific evidence for their claims, which aids in the appraisal of correctness.
Predictability and Surprise in Large Generative Models
This paper highlights a counterintuitive property of large-scale generative models, which have a paradoxical combination of predictable loss on a broad training distribution, and unpredictable specific capabilities, inputs, and outputs, and analyzed how these conflicting properties combine to give model developers various motivations for deploying these models, and challenges that can hinder deployment.
Red Teaming Language Models with Language Models
This work automatically finds cases where a target LM behaves in a harmful way, by generating test cases (“red teaming”) using another LM, and evaluates the target LM’s replies to generated test questions using a classifier trained to detect offensive content.
Training language models to follow instructions with human feedback
The results show that fine-tuning with human feedback is a promising direction for aligning language models with human intent and showing improvements in truthfulness and reductions in toxic output generation while having minimal performance regressions on public NLP datasets.
LaMDA: Language Models for Dialog Applications
It is demonstrated that fine-tuning with annotated data and enabling the model to consult external knowledge sources can lead to significant improvements towards the two key challenges of safety and factual grounding.
The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models
An anomaly detection task for aberrant policies is proposed and several baseline detectors are offered for phase transitions: capability thresholds at which the agent’s behavior qualitatively shifts, leading to a sharp decrease in the true reward.
Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets
This paper argues that small algorithmically generated datasets provide a fertile ground for studying a poorly understood aspect of deep learning: generalization of overparametrized neural networks beyond memorization of the finite training dataset.
Scaling Language Models: Methods, Analysis & Insights from Training Gopher
This paper presents an analysis of Transformer-based language model performance across a wide range of model scales -- from models with tens of millions of parameters up to a 280 billion parameter model called Gopher.
WebGPT: Browser-assisted question-answering with human feedback
GPT-3 is tuned to answer long-form questions using a text-based web-browsing environment, which allows the model to search and navigate the web, and the best model is obtained by using behavior cloning, and then performing rejection sampling against a reward model trained to predict human preferences.