SURF: Semi-supervised Reward Learning with Data Augmentation for Feedback-efficient Preference-based Reinforcement Learning

  title={SURF: Semi-supervised Reward Learning with Data Augmentation for Feedback-efficient Preference-based Reinforcement Learning},
  author={Jongjin Park and Younggyo Seo and Jinwoo Shin and Honglak Lee and P. Abbeel and Kimin Lee},
Preference-based reinforcement learning (RL) has shown potential for teaching agents to perform the target tasks without a costly, pre-defined reward function by learning the reward with a supervisor’s preference between the two agent behaviors. However, preference-based learning often requires a large amount of human feedback, making it difficult to apply this approach to various applications. This data-efficiency problem, on the other hand, has been typically addressed by using unlabeled… 

Reinforcement Learning from Diverse Human Preferences

The proposed method is tested on a variety of tasks in DMcontrol and Meta-world and has shown consistent and significant improvements over existing preference-based RL algorithms when learning from diverse feedback, paving the way for real-world applications of RL methods.

Few-Shot Preference Learning for Human-in-the-Loop RL

Motivated by the success of metalearning, pre-train preference models on prior task data and quickly adapt them for new tasks using only a handful of queries, reducing the amount of online feedback needed to train manipulation policies in Meta-World by 20 ×, and demonstrating the effectiveness of this method on a real Franka Panda Robot.

Rewards Encoding Environment Dynamics Improves Preference-based Reinforcement Learning

It is demonstrated that encoding environment dynamics in the reward function (REED) dramatically reduces the number of preference labels required in state-of-the-art preference-based RL frameworks, and it is hypothesized that REED-based methods better partition the state- action space and facilitate generalization to state-action pairs not included in the preference dataset.

Efficient Meta Reinforcement Learning for Preference-based Fast Adaptation

A meta-RL algorithm that enables fast policy adaptation with preference-based feedback and can adapt to new tasks by querying human’s preference between behavior trajectories instead of using per-step numeric rewards is developed.

Symbol Guided Hindsight Priors for Reward Learning from Human Preferences

This work presents the PRIor Over Rewards (PRIOR) framework, which incorporates priors about the structure of the reward function and the preference feedback into the reward learning process, and demonstrates that using an abstract state space for the computation of the priors further improves the rewardlearning and the agent’s performance.

Advances in Preference-based Reinforcement Learning: A Review

A unified PbRL framework is presented to include the newly emerging approaches that improve the scalability and efficiency of Pb RL.

Reinforcement Learning with Action-Free Pre-Training from Videos

A framework that learns representations useful for understanding the dynamics via generative pretraining on videos that improves both performances and sample-efficiency of vision-based RL in a variety of manipulation and locomotion tasks is introduced.

PrefRec: Preference-based Recommender Systems for Reinforcing Long-term User Engagement

This work proposes a novel paradigm, Pre ference-based Rec ommender systems (PrefRec), which allows RL recommender systems to learn from preferences about users’ historical behaviors rather than explicitly defined rewards, and designs an effective optimization method for PrefRec, which uses an additional value function, expectile regression and reward model pre-training to improve the performance.

Relative Behavioral Attributes: Filling the Gap between Symbolic Goal Specification and Reward Learning from Human Preferences

This work proposes two practical methods that can learn to model any kind of behavioral attributes from ordered behavior clips, and demonstrates the effectiveness of the methods on four tasks with nine different behavioral attributes, showing that once the attributes are learned, end users can produce desirable agent behaviors relatively effortlessly, by providing feedback just around ten times.

Task Decoupling in Preference-based Reinforcement Learning for Personalized Human-Robot Interaction

  • Mingjiang LiuChunlin Chen
  • Computer Science, Psychology
    2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)
  • 2022
This work decouple the task from preference in human-robot interaction, and utilizes a sketchy task reward derived from task priori to instruct robots to conduct more effective task exploration and incorporates prior knowledge of the task into preference-based RL.



PEBBLE: Feedback-Efficient Interactive Reinforcement Learning via Relabeling Experience and Unsupervised Pre-training

This work presents an off-policy, interactive RL algorithm that capitalizes on the strengths of both feedback and off- policy learning, and is able to utilize real-time human feedback to effectively prevent reward exploitation and learn new behaviors that are difficult to specify with standard reward functions.

B-Pref: Benchmarking Preference-Based Reinforcement Learning

This paper introduces B-Pref: a benchmark specially designed for preference-based RL, and showcases the utility of the benchmark by using it to analyze algorithmic design choices, such as selecting informative queries, for state-of-the-art preferencebased RL algorithms.

Batch Active Preference-Based Learning of Reward Functions

This paper develops a new algorithm, batch active preference-based learning, that enables efficient learning of reward functions using as few data samples as possible while still having short query generation times.

Active Preference-Based Learning of Reward Functions

This work builds on work in label ranking and proposes to learn from preferences (or comparisons) instead: the person provides the system a relative preference between two trajectories, and takes an active learning approach, in which the system decides on what preference queries to make.

Reward learning from human preferences and demonstrations in Atari

This work trains a deep neural network to model the reward function and use its predicted reward to train an DQN-based deep reinforcement learning agent on 9 Atari games and achieves strictly superhuman performance on 2 games without using game rewards.

Deep Reinforcement Learning from Human Preferences

This work explores goals defined in terms of (non-expert) human preferences between pairs of trajectory segments in order to effectively solve complex RL tasks without access to the reward function, including Atari games and simulated robot locomotion.

Reinforcement Learning with Augmented Data

It is shown that augmentations such as random translate, crop, color jitter, patch cutout, random convolutions, and amplitude scale can enable simple RL algorithms to outperform complex state-of-the-art methods across common benchmarks.

Quantifying Generalization in Reinforcement Learning

It is shown that deeper convolutional architectures improve generalization, as do methods traditionally found in supervised learning, including L2 regularization, dropout, data augmentation and batch normalization.

Scalable agent alignment via reward modeling: a research direction

This work outlines a high-level research direction to solve the agent alignment problem centered around reward modeling: learning a reward function from interaction with the user and optimizing the learned reward function with reinforcement learning.

Unsupervised Data Augmentation for Consistency Training

A new perspective on how to effectively noise unlabeled examples is presented and it is argued that the quality of noising, specifically those produced by advanced data augmentation methods, plays a crucial role in semi-supervised learning.