Corpus ID: 237213267

Provable Benefits of Actor-Critic Methods for Offline Reinforcement Learning

  title={Provable Benefits of Actor-Critic Methods for Offline Reinforcement Learning},
  author={Andrea Zanette and Martin J. Wainwright and Emma Brunskill},
Actor-critic methods are widely used in offline reinforcement learning practice, but are not so well-understood theoretically. We propose a new offline actor-critic algorithm that naturally incorporates the pessimism principle, leading to several key advantages compared to the state of the art. The algorithm can operate when the Bellman evaluation operator is closed with respect to the action value function of the actor’s policies; this is a more general setting than the low-rank MDP model… Expand
Representation Learning for Online and Offline RL in Low-rank MDPs
An algorithm REP-UCB—Upper Confidence Bound driven REPresentation learning for RL, which significantly improves the sample complexity and is simpler than FLAMBE, as it directly balances the interplay between representation learning, exploration, and exploitation. Expand
Towards Instance-Optimal Offline Reinforcement Learning with Pessimism
  • Ming Yin, Yu-Xiang Wang
  • Computer Science, Mathematics
  • 2021
We study the offline reinforcement learning (offline RL) problem, where the goal is to learn a reward-maximizing policy in an unknown Markov Decision Process (MDP) using the data coming from a policyExpand


Uncertainty Weighted Actor-Critic for Offline Reinforcement Learning
Uncertainty Weighted Actor-Critic (UWAC), an algorithm that detects OOD state-action pairs and down-weights their contribution in the training objectives accordingly, is proposed and observed that UWAC substantially improves model stability during training. Expand
Doubly Robust Off-policy Value Evaluation for Reinforcement Learning
This work extends the doubly robust estimator for bandits to sequential decision-making problems, which gets the best of both worlds: it is guaranteed to be unbiased and can have a much lower variance than the popular importance sampling estimators. Expand
Provably Good Batch Reinforcement Learning Without Great Exploration
It is shown that a small modification to Bellman optimality and evaluation back-up to take a more conservative update can have much stronger guarantees on the performance of the output policy, and in certain settings, they can find the approximately best policy within the state-action space explored by the batch data, without requiring a priori assumptions of concentrability. Expand
Is Pessimism Provably Efficient for Offline RL?
A pessimistic variant of the value iteration algorithm (PEVI), which incorporates an uncertainty quantifier as the penalty function and establishes a data-dependent upper bound on the suboptimality of PEVI for general Markov decision processes (MDPs). Expand
Behavior Regularized Offline Reinforcement Learning
A general framework, behavior regularized actor critic (BRAC), is introduced to empirically evaluate recently proposed methods as well as a number of simple baselines across a variety of offline continuous control tasks. Expand
Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor
This paper proposes soft actor-critic, an off-policy actor-Critic deep RL algorithm based on the maximum entropy reinforcement learning framework, and achieves state-of-the-art performance on a range of continuous control benchmark tasks, outperforming prior on-policy and off- policy methods. Expand
AlgaeDICE: Policy Gradient from Arbitrary Experience
A new formulation of max-return optimization that allows the problem to be re-expressed by an expectation over an arbitrary behavior-agnostic and off-policy data distribution and shows that, if auxiliary dual variables of the objective are optimized, then the gradient of the off-Policy objective is exactly the on-policy policy gradient, without any use of importance weighting. Expand
Policy Gradient Methods for Reinforcement Learning with Function Approximation
This paper proves for the first time that a version of policy iteration with arbitrary differentiable function approximation is convergent to a locally optimal policy. Expand
Provably Efficient Reinforcement Learning with Linear Function Approximation
This paper proves that an optimistic modification of Least-Squares Value Iteration (LSVI) achieves regret, where d is the ambient dimension of feature space, H is the length of each episode, and T is the total number of steps, and is independent of the number of states and actions. Expand
Bellman-consistent Pessimism for Offline Reinforcement Learning
The notion of Bellman-consistent pessimism for general function approximation is introduced: instead of calculating a point-wise lower bound for the value function, pessimism is implemented at the initial state over the set of functions consistent with the Bellman equations. Expand