• Corpus ID: 252780451

A Behavior Regularized Implicit Policy for Offline Reinforcement Learning

  title={A Behavior Regularized Implicit Policy for Offline Reinforcement Learning},
  author={Shentao Yang and Zhendong Wang and Huangjie Zheng and Yihao Feng and Mingyuan Zhou},
Offline reinforcement learning enables learning from a fixed dataset, without further interactions with the environment. The lack of environmental interactions makes the policy training vulnerable to state-action pairs far from the training dataset and prone to missing rewarding actions. For training more effective agents, we propose a framework that supports learning a flexible yet well-regularized fully-implicit policy. We further propose a simple modification to the classical policy-matching… 

Figures and Tables from this paper



Conservative Q-Learning for Offline Reinforcement Learning

Conservative Q-learning (CQL) is proposed, which aims to address limitations of offline RL methods by learning a conservative Q-function such that the expected value of a policy under this Q- function lower-bounds its true value.

Stabilizing Off-Policy Q-Learning via Bootstrapping Error Reduction

A practical algorithm, bootstrapping error accumulation reduction (BEAR), is proposed and it is demonstrated that BEAR is able to learn robustly from different off-policy distributions, including random and suboptimal demonstrations, on a range of continuous control tasks.

Addressing Function Approximation Error in Actor-Critic Methods

This paper builds on Double Q-learning, by taking the minimum value between a pair of critics to limit overestimation, and draws the connection between target networks and overestimation bias.

Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks

This work introduces a class of CNNs called deep convolutional generative adversarial networks (DCGANs), that have certain architectural constraints, and demonstrates that they are a strong candidate for unsupervised learning.

Offline Reinforcement Learning with Implicit Q-Learning

This work proposes a new offline RL method that never needs to evaluate actions outside of the dataset, but still enables the learned policy to improve substantially over the best behavior in the data through generalization, known as implicit Q-learning (IQL).

Uncertainty-Based Offline Reinforcement Learning with Diversified Q-Ensemble

This work proposes an uncertainty-based offline RL method that takes into account the confidence of the Q-value prediction and does not require any estimation or sampling of the data distribution, and shows that the clipped Q-learning, a technique widely used in online RL, can be leveraged to successfully penalize OOD data points with high prediction uncertainties.

OptiDICE: Offline Policy Optimization via Stationary Distribution Correction Estimation

This paper presents an offline RL algorithm, OptiDICE, that directly estimates the stationary distribution corrections of the optimal policy and does not rely on policy-gradients, unlike previous offline RL algorithms.

A Minimalist Approach to Offline Reinforcement Learning

It is shown that the performance of state-of-the-art RL algorithms can be matched by simply adding a behavior cloning term to the policy update of an online RL algorithm and normalizing the data, and the resulting algorithm is a simple to implement and tune baseline.

Offline Reinforcement Learning with Fisher Divergence Critic Regularization

This work parameterizes the critic as the logbehavior-policy, which generated the offline data, plus a state-action value offset term, which can be learned using a neural network, and term the resulting algorithm Fisher-BRC (Behavior Regularized Critic), which achieves both improved performance and faster convergence over existing state-of-the-art methods.

Implicit Distributional Reinforcement Learning

An implicit distributional actor critic that consists of a distributional critic, built on two deep generator networks, and a semi-implicit actor (SIA), powered by a flexible policy distribution to improve the sample efficiency of policy-gradient based reinforcement learning algorithms.