Corpus ID: 202660943

Fine-Tuning Language Models from Human Preferences

@article{Ziegler2019FineTuningLM,
  title={Fine-Tuning Language Models from Human Preferences},
  author={Daniel M. Ziegler and Nisan Stiennon and Jeff Wu and Tom B. Brown and Alec Radford and Dario Amodei and Paul Christiano and Geoffrey Irving},
  journal={ArXiv},
  year={2019},
  volume={abs/1909.08593}
}
Reward learning enables the application of reinforcement learning (RL) to tasks where reward is defined by human judgment, building a model of reward by asking humans questions. [...] Key Result For summarization, models trained with 60,000 comparisons copy whole sentences from the input but skip irrelevant preamble; this leads to reasonable ROUGE scores and very good performance according to our human labelers, but may be exploiting the fact that labelers rely on simple heuristics.Expand
Recursively Summarizing Books with Human Feedback
TLDR
This method combines learning from human feedback with recursive task decomposition: it uses models trained on smaller parts of the task to assist humans in giving feedback on the broader task, and generates sensible summaries of entire books. Expand
Human-centric Dialog Training via Offline Reinforcement Learning
TLDR
This work identifies implicit conversational cues including language similarity, elicitation of laughter, sentiment, and more, which indicate positive human feedback, and embed these in multiple reward functions and develops a novel class of offline RL algorithms. Expand
Generative Conversational Networks
TLDR
This work shows that this approach is able to generalise from seed data and performs well in limited data and limited computation settings, with significant gains for intent detection and slot tagging across multiple datasets: ATIS, TOD, SNIPS, and Restaurants8k. Expand
Way Off-Policy Batch Deep Reinforcement Learning of Implicit Human Preferences in Dialog
TLDR
This work develops a novel class of off-policy batch RL algorithms, able to effectively learn offline, without exploring, from a fixed batch of human interaction data, using models pre-trained on data as a strong prior, and uses KL-control to penalize divergence from this prior during RL training. Expand
Neural Language Generation: Formulation, Methods, and Evaluation
TLDR
There is no standard way to assess the quality of text produced by these generative models, which constitutes a serious bottleneck towards the progress of the field, so this survey will provide an informative overview of formulations, methods, and assessments of neural natural language generation. Expand
Skill Preferences: Learning to Extract and Execute Robotic Skills from Human Feedback
TLDR
Skill Preferences (SkiP), an algorithm that learns a model over human preferences and uses it to extract human-aligned skills from offline data, substantially outperforms prior leading RL algorithms with human preferences as well as leading skill extraction algorithms without human preferences. Expand
Plug and Play Language Models: A Simple Approach to Controlled Text Generation
TLDR
The Plug and Play Language Model (PPLM) for controllable language generation is proposed, which combines a pretrained LM with one or more simple attribute classifiers that guide text generation without any further training of the LM. Expand
Unsupervised Contextual Paraphrase Generation using Lexical Control and Reinforcement Learning
TLDR
An unsupervised framework to generate contextual paraphrases using autoregressive models is proposed and an automated metric based on Semantic Similarity, Textual Entailment, Expression Diversity and Fluency is proposed to evaluate the quality of contextual paraphRases and demonstrate performance improvement with Reinforcement Learning (RL) fine-tuning. Expand
OptAGAN: Entropy-based finetuning on text VAE-GAN
TLDR
This work combines the training of GANs in the latent space, with the finetuning of the decoder of Optimus for single word generation, and finetune using reinforcement learning (RL) by exploiting the structure of GPT2 and by adding entropy-based intrinsically motivated rewards to balance between quality and diversity. Expand
SideControl: Controlled Open-domain Dialogue Generation via Additive Side Networks
  • Wanyu Du, Yangfeng Ji
  • Computer Science
  • ArXiv
  • 2021
TLDR
A novel approach to control the generation of Transformer-based pretrained language models is proposed: the SIDECONTROL framework, which leverages a novel control attributes loss to incorporate useful control signals, and is shown to perform well with very limited training samples. Expand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 49 REFERENCES
A Survey of Reinforcement Learning Informed by Natural Language
TLDR
The time is right to investigate a tight integration of natural language understanding into Reinforcement Learning in particular, and the state of the field is surveyed, including work on instruction following, text games, and learning from textual domain knowledge. Expand
Better Rewards Yield Better Summaries: Learning to Summarise Without References
TLDR
This work learns a reward function from human ratings on 2,500 summaries that can be used to train RL based summarisation systems without using any reference summaries, and shows that the learned rewards have significantly higher correlation with human ratings than previous approaches. Expand
Language Models are Unsupervised Multitask Learners
TLDR
It is demonstrated that language models begin to learn these tasks without any explicit supervision when trained on a new dataset of millions of webpages called WebText, suggesting a promising path towards building language processing systems which learn to perform tasks from their naturally occurring demonstrations. Expand
Reward Learning for Efficient Reinforcement Learning in Extractive Document Summarisation
TLDR
RELIS is proposed, a novel RL paradigm that learns a reward function with Learning-to-Rank (L2R) algorithms at training time and uses this reward function to train an input-specific RL policy at test time and it is proved that RELIS guarantees to generate near-optimal summaries with appropriate L2R and RL algorithms. Expand
Learning to Understand Goal Specifications by Modelling Reward
TLDR
A framework within which instruction-conditional RL agents are trained using rewards obtained not from the environment, but from reward models which are jointly trained from expert examples, which allows an agent to adapt to changes in the environment without requiring new expert examples. Expand
Improving Language Understanding by Generative Pre-Training
TLDR
The general task-agnostic model outperforms discriminatively trained models that use architectures specifically crafted for each task, significantly improving upon the state of the art in 9 out of the 12 tasks studied. Expand
Reliability and Learnability of Human Bandit Feedback for Sequence-to-Sequence Reinforcement Learning
TLDR
Improvements of over 1 BLEU can be obtained by integrating a regression-based reward estimator trained on cardinal feedback for 800 translations into RL for NMT, showing that RL is possible even from small amounts of fairly reliable human feedback, pointing to a great potential for applications at larger scale. Expand
Reinforcement Learning for Bandit Neural Machine Translation with Simulated Human Feedback
TLDR
A reinforcement learning algorithm that improves neural machine translation systems from simulated human feedback that combines the advantage actor-critic algorithm with the attention-based neural encoder-decoder architecture and effectively optimizes traditional corpus-level machine translation metrics. Expand
Reward learning from human preferences and demonstrations in Atari
To solve complex real-world problems with reinforcement learning, we cannot rely on manually specified reward functions. Instead, we can have humans communicate an objective to the agent directly. InExpand
Controllable Neural Story Generation via Reinforcement Learning
TLDR
A human subject evaluation shows that stories generated by the introduced policy gradient reinforcement learning approach to open story generation are perceived to have significantly higher plausible event ordering and plot coherence over a baseline language modeling technique without perceived degradation of overall quality, enjoyability, or local causality. Expand
...
1
2
3
4
5
...