RL with KL penalties is better viewed as Bayesian inference

  title={RL with KL penalties is better viewed as Bayesian inference},
  author={Tomasz Korbak and Ethan Perez and Christopher L. Buckley},
Reinforcement learning (RL) is frequently employed in fine-tuning large language models (LMs), such as GPT-3, to penalize them for undesirable features of generated sequences, such as offensiveness, social bias, harmfulness or falsehood. The RL formulation involves treating the LM as a policy and updating it to maximise the expected value of a reward function which captures human preferences, such as non-offensiveness. In this paper, we analyze challenges associated with treating a language… 


Way Off-Policy Batch Deep Reinforcement Learning of Implicit Human Preferences in Dialog
This work develops a novel class of off-policy batch RL algorithms, able to effectively learn offline, without exploring, from a fixed batch of human interaction data, using models pre-trained on data as a strong prior, and uses KL-control to penalize divergence from this prior during RL training.
Fine-Tuning Language Models from Human Preferences
This paper builds on advances in generative pretraining of language models to apply reward learning to four natural language tasks: continuing text with positive sentiment or physically descriptive language, and summarization tasks on the TL;DR and CNN/Daily Mail datasets.
Sequence Tutor: Conservative Fine-Tuning of Sequence Generation Models with KL-control
The proposed method improves the desired properties and structure of the generated sequences, while maintaining information originally learned from data, as well as sample diversity.
On the Weaknesses of Reinforcement Learning for Neural Machine Translation
It is proved that one of the most common RL methods for MT does not optimize the expected reward, as well as show that other methods take an infeasibly long time to converge.
Learning to summarize from human feedback
This work shows that it is possible to significantly improve summary quality by training a model to optimize for human preferences, and establishes that the reward model generalizes to new datasets, and that optimizing the authors' reward model results in better summaries than optimizing ROUGE according to humans.
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
An iterated online mode of training, where preference models and RL policies are updated on a weekly cadence with fresh human feedback data, and a roughly linear relation between the RL reward and the square root of the KL divergence between the policy and its initialization is identified.
Action and Perception as Divergence Minimization
A unified objective for action and perception of intelligent agents is introduced, and interpreting the target distribution as a latent variable model suggests powerful world models as a path toward highly adaptive agents that seek large niches in their environments, rendering task rewards optional.
A Distributional Approach to Controlled Text Generation
This approach permits to specify, in a single formal framework, both “pointwise” and “distributional” constraints over the target LM while minimizing KL divergence from the initial LM distribution.
Red Teaming Language Models with Language Models
This work automatically finds cases where a target LM behaves in a harmful way, by generating test cases (“red teaming”) using another LM, and evaluates the target LM’s replies to generated test questions using a classifier trained to detect offensive content.
On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜
Recommendations including weighing the environmental and financial costs first, investing resources into curating and carefully documenting datasets rather than ingesting everything on the web, and carrying out pre-development exercises evaluating how the planned approach fits into research and development goals and supports stakeholder values are provided.