• Corpus ID: 246608284

Exploiting Action Impact Regularity and Exogenous State Variables for Offline Reinforcement Learning

@inproceedings{Liu2021ExploitingAI,
  title={Exploiting Action Impact Regularity and Exogenous State Variables for Offline Reinforcement Learning},
  author={Vincent Liu and James Wright and Martha White},
  year={2021}
}
Offline reinforcement learning—learning a policy from a batch of data—is known to be hard for general MDPs. In this work, we explore a restricted class of MDPs to obtain guarantees for offline reinforcement learning. The key property, which we call Action Impact Regularity (AIR), is that actions primarily impact a part of the state (an endogenous component) with limited impact on the remaining part of the state (an exogenous component). We propose an algorithm that exploits the AIR property… 

Figures from this paper

References

SHOWING 1-10 OF 45 REFERENCES
Off-Policy Deep Reinforcement Learning without Exploration
TLDR
This paper introduces a novel class of off-policy algorithms, batch-constrained reinforcement learning, which restricts the action space in order to force the agent towards behaving close to on-policy with respect to a subset of the given data.
Discovering and Removing Exogenous State Variables and Rewards for Reinforcement Learning
TLDR
Exogenous state variables and rewards are formalized and conditions under which an MDP with exogenous state can be decomposed into an exogenous Markov Reward Process involving only the exogenousState+reward and an endogenous Markov Decision Process defined with respect to only the endogenous rewards are identified.
Conservative Q-Learning for Offline Reinforcement Learning
TLDR
Conservative Q-learning (CQL) is proposed, which aims to address limitations of offline RL methods by learning a conservative Q-function such that the expected value of a policy under this Q- function lower-bounds its true value.
Provably Good Batch Reinforcement Learning Without Great Exploration
TLDR
It is shown that a small modification to Bellman optimality and evaluation back-up to take a more conservative update can have much stronger guarantees on the performance of the output policy, and in certain settings, they can find the approximately best policy within the state-action space explored by the batch data, without requiring a priori assumptions of concentrability.
Towards Hyperparameter-free Policy Selection for Offline Reinforcement Learning
TLDR
This paper designs hyperparameter-free algorithms for policy selection based on BVFT, a recent theoretical advance in value-function selection, and demonstrates their effectiveness in discrete-action benchmarks such as Atari.
Behavior Regularized Offline Reinforcement Learning
TLDR
A general framework, behavior regularized actor critic (BRAC), is introduced to empirically evaluate recently proposed methods as well as a number of simple baselines across a variety of offline continuous control tasks.
Near-Optimal Reinforcement Learning in Polynomial Time
TLDR
New algorithms for reinforcement learning are presented and it is proved that they have polynomial bounds on the resources required to achieve near-optimal return in general Markov decision processes.
High Confidence Policy Improvement
We present a batch reinforcement learning (RL) algorithm that provides probabilistic guarantees about the quality of each policy that it proposes, and which has no hyper-parameters that require
Title of Thesis: Reinforcement Learning in Environments with Independent Delayed-sense Dynamics Reinforcement Learning in Environments with Independent Delayed-sense Dynamics
TLDR
This thesis develops four reinforcement learning algorithms that exploit the structure of IDSD problems to achieve better efficiency and shows experimentally that their algorithms evaluate a given policy more accurately than the corresponding TD(0).
Exponential Lower Bounds for Batch Reinforcement Learning: Batch RL can be Exponentially Harder than Online RL
TLDR
This work helps formalize the issue known as deadly triad and explains that the bootstrapping problem is potentially more severe than the extrapolation issue for RL because unlike the latter,Bootstrapping cannot be mitigated by adding more samples, and online exploration is critical to enable sample efficient RL with function approximation.
...
...