Offline RL for Natural Language Generation with Implicit Language Q Learning

  title={Offline RL for Natural Language Generation with Implicit Language Q Learning},
  author={Charles Burton Snell and Ilya Kostrikov and Yi Su and Mengjiao Yang and Sergey Levine},
Large language models distill broad knowledge from text corpora. However, they can be inconsistent when it comes to completing user specified tasks. This issue can be addressed by finetuning such models via supervised learning on curated datasets, or via reinforcement learning. In this work, we propose a novel offline RL motivated method, implicit language Q-learning (ILQL), designed for use on language models, that combines both the flexible utility optimization framework of traditional RL… 



Training language models to follow instructions with human feedback

The results show that fine-tuning with human feedback is a promising direction for aligning language models with human intent and showing improvements in truthfulness and reductions in toxic output generation while having minimal performance regressions on public NLP datasets.

Fine-Tuning Language Models from Human Preferences

This paper builds on advances in generative pretraining of language models to apply reward learning to four natural language tasks: continuing text with positive sentiment or physically descriptive language, and summarization tasks on the TL;DR and CNN/Daily Mail datasets.

Learning to summarize from human feedback

This work shows that it is possible to significantly improve summary quality by training a model to optimize for human preferences, and establishes that the reward model generalizes to new datasets, and that optimizing the authors' reward model results in better summaries than optimizing ROUGE according to humans.

Human-centric Dialog Training via Offline Reinforcement Learning

This work identifies implicit conversational cues including language similarity, elicitation of laughter, sentiment, and more, which indicate positive human feedback, and embed these in multiple reward functions and develops a novel class of offline RL algorithms.

CTRL: A Conditional Transformer Language Model for Controllable Generation

CTRL is released, a 1.63 billion-parameter conditional transformer language model, trained to condition on control codes that govern style, content, and task-specific behavior, providing more explicit control over text generation.

GeDi: Generative Discriminator Guided Sequence Generation

GeDi is proposed as an efficient method for using smaller LMs as generative discriminators to guide generation from large LMs to make them safer and more controllable, and is found that GeDi gives stronger controllability than the state of the art method while also achieving generation speeds more than 30 times faster.

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.

Context-Aware Language Modeling for Goal-Oriented Dialogue Systems

This work forms goal-oriented dialogue as a partially observed Markov decision process, interpreting the language model as a representation of both the dynamics and the policy, allowing a simple and effective method to finetune language models in a goal-aware way, leading to significantly improved task performance.

Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing

A unified set of mathematical notations that can cover a wide variety of existing work, and organize existing work along several dimensions, e.g. the choice of pre-trained language models, prompts, and tuning strategies are described.

Generating More Interesting Responses in Neural Conversation Models with Distributional Constraints

This work proposes a simple yet effective approach for incorporating side information in the form of distributional constraints over the generated responses that generates responses that are much less generic without sacrificing plausibility.