On Reinforcement Learning and Distribution Matching for Fine-Tuning Language Models with no Catastrophic Forgetting

  title={On Reinforcement Learning and Distribution Matching for Fine-Tuning Language Models with no Catastrophic Forgetting},
  author={Tomasz Korbak and Hady ElSahar and Germ{\'a}n Kruszewski and Marc Dymetman},
The availability of large pre-trained models is changing the landscape of Machine Learning research and practice, moving from a “training from scratch” to a “fine-tuning” paradigm. While in some applications the goal is to “nudge” the pre-trained distribution towards preferred outputs, in others it is to steer it towards a different distribution over the sample space. Two main paradigms have emerged to tackle this challenge: Reward Maximization (RM) and, more recently, Distribution Matching (DM… 

RL with KL penalties is better viewed as Bayesian inference

Challenges associated with treating language models as RL policies are analyzed and it is suggested that RL is not a good formal framework for thinking about fine-tuning LMs.



Plug and Play Language Models: A Simple Approach to Controlled Text Generation

The Plug and Play Language Model (PPLM) for controllable language generation is proposed, which combines a pretrained LM with one or more simple attribute classifiers that guide text generation without any further training of the LM.

A Distributional Approach to Controlled Text Generation

This approach permits to specify, in a single formal framework, both “pointwise” and “distributional” constraints over the target LM while minimizing KL divergence from the initial LM distribution.

Reinforcement Learning with Deep Energy-Based Policies

A method for learning expressive energy-based policies for continuous states and actions, which has been feasible only in tabular domains before, is proposed and a new algorithm, called soft Q-learning, that expresses the optimal policy via a Boltzmann distribution is applied.

Sequence Level Training with Recurrent Neural Networks

This work proposes a novel sequence level training algorithm that directly optimizes the metric used at test time, such as BLEU or ROUGE, and outperforms several strong baselines for greedy generation.

Reinforcement Learning: An Introduction

This book provides a clear and simple account of the key ideas and algorithms of reinforcement learning, which ranges from the history of the field's intellectual foundations to the most recent developments and applications.

Policy Gradient Methods for Reinforcement Learning with Function Approximation

This paper proves for the first time that a version of policy iteration with arbitrary differentiable function approximation is convergent to a locally optimal policy.

Simple statistical gradient-following algorithms for connectionist reinforcement learning

This article presents a general class of associative reinforcement learning algorithms for connectionist networks containing stochastic units that are shown to make weight adjustments in a direction that lies along the gradient of expected reinforcement in both immediate-reinforcement tasks and certain limited forms of delayed-reInforcement tasks, and they do this without explicitly computing gradient estimates.

Importance Sampling. In Monte Carlo theory, methods and examples, chapter

  • 2013

Ethical and social risks of harm from Language Models

This paper aims to help structure the risk landscape associated with large-scale Language Models (LMs) by analyzing a wide range of established and anticipated risks, drawing on multidisciplinary literature from computer science, linguistics, and social sciences.

Efficient Exploration via State Marginal Matching

This work recast exploration as a problem of State Marginal Matching (SMM), where it is demonstrated that agents that directly optimize the SMM objective explore faster and adapt more quickly to new tasks as compared to prior exploration methods.