Off-Policy Deep Reinforcement Learning by Bootstrapping the Covariate Shift

@inproceedings{Gelada2019OffPolicyDR,
  title={Off-Policy Deep Reinforcement Learning by Bootstrapping the Covariate Shift},
  author={Carles Gelada and Marc G. Bellemare},
  booktitle={AAAI},
  year={2019}
}
In this paper we revisit the method of off-policy corrections for reinforcement learning (COP-TD) pioneered by Hallak et al. (2017. [] Key Method We address these two issues by introducing a discount factor into COP-TD. We analyze the behavior of discounted COP-TD and find it better behaved from a theoretical perspective. We also propose an alternative soft normalization penalty that can be minimized online and obviates the need for an explicit projection step. We complement our analysis with an empirical…

Figures from this paper

Representation Balancing Offline Model-based Reinforcement Learning
TLDR
This paper addresses the curse of horizon exhibited by RepBM, rejecting most of the pre-collected data in long-term tasks, and presents a new objective for model learning motivated by recent advances in the estimation of stationary distribution corrections, which effectively overcomes the aforementioned limitation of RepBM.
Stabilizing Off-Policy Q-Learning via Bootstrapping Error Reduction
TLDR
A practical algorithm, bootstrapping error accumulation reduction (BEAR), is proposed and it is demonstrated that BEAR is able to learn robustly from different off-policy distributions, including random and suboptimal demonstrations, on a range of continuous control tasks.
DualDICE: Behavior-Agnostic Estimation of Discounted Stationary Distribution Corrections
TLDR
This work proposes an algorithm, DualDICE, that is agnostic to knowledge of the behavior policy (or policies) used to generate the dataset and improves accuracy compared to existing techniques.
DualDICE: Efficient Estimation of Off-Policy Stationary Distribution Corrections
TLDR
This work derives and study an algorithm, DualDICE, for estimating stationary distribution ratios, and finds that this algorithm yields significant accuracy improvements compared to competing techniques.
Off-Policy Policy Gradient Algorithms by Constraining the State Distribution Shift
TLDR
This work first does a systematic analysis of state distribution mismatch in off-policy learning, and develops a novel off-Policy policy optimization method to constraint the state distribution shift.
Importance Sampling Techniques for Policy Optimization
TLDR
A class of model-free, policy search algorithms that extend the recent Policy Optimization via Importance Sampling by incorporating two advanced variance reduction techniques: per–decision and multiple importance sampling are proposed and analyzed.
Way Off-Policy Batch Deep Reinforcement Learning of Implicit Human Preferences in Dialog
TLDR
This work develops a novel class of off-policy batch RL algorithms, able to effectively learn offline, without exploring, from a fixed batch of human interaction data, using models pre-trained on data as a strong prior, and uses KL-control to penalize divergence from this prior during RL training.
Offline Reinforcement Learning with Soft Behavior Regularization
TLDR
This work derives a new policy learning objective that can be used in the offline setting, which corresponds to the advantage function value of the behavior policy, multiplying by a statemarginal density ratio, and proposes a practical way to compute the density ratio and demonstrates its equivalence to a statedependent behavior regularization.
DROMO: Distributionally Robust Offline Model-based Policy Optimization
TLDR
Distributionally robust offline model-based policy optimization (DROMO) is proposed, which leverages the ideas in distributionally robust optimization to penalize a broader range of out-of-distribution state-action pairs beyond the standard empirical out- of-dist distribution Q-value minimization.
Learning Expected Emphatic Traces for Deep RL
TLDR
This paper develops a multi-step emphatic weighting that can be combined with replay, and a time-reversed n-step TD learning algorithm to learn the required emphaticWeighting, and shows that these state weightings reduce variance compared with prior approaches, while providing convergence guarantees.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 28 REFERENCES
An Emphatic Approach to the Problem of Off-policy Temporal-Difference Learning
TLDR
It is shown that varying the emphasis of linear TD(γ)'s updates in a particular way causes its expected update to become stable under off-policy training.
The Fixed Points of Off-Policy TD
TLDR
A novel TD algorithm is proposed that has approximation guarantees even in the case of off-policy sampling and which empirically outperforms existing TD methods.
Deep Reinforcement Learning with Double Q-Learning
TLDR
This paper proposes a specific adaptation to the DQN algorithm and shows that the resulting algorithm not only reduces the observed overestimations, as hypothesized, but that this also leads to much better performance on several games.
The Reactor: A fast and sample-efficient Actor-Critic agent for Reinforcement Learning
TLDR
This work introduces a new policy evaluation algorithm called Distributional Retrace, which brings multi-step off-policy updates to the distributional reinforcement learning setting, and introduces the \b{eta}-leave-one-out policy gradient algorithm which improves the trade-off between variance and bias by using action values as a baseline.
Safe and Efficient Off-Policy Reinforcement Learning
TLDR
A novel algorithm, Retrace ($\lambda$), is derived, believed to be the first return-based off-policy control algorithm converging a.s. to $Q^*$ without the GLIE assumption (Greedy in the Limit with Infinite Exploration).
Eligibility Traces for Off-Policy Policy Evaluation
TLDR
This paper considers the off-policy version of the policy evaluation problem, for which only one eligibility trace algorithm is known, a Monte Carlo method, and analyzes and compares this and four new eligibility trace algorithms, emphasizing their relationships to the classical statistical technique known as importance sampling.
A Distributional Perspective on Reinforcement Learning
TLDR
This paper argues for the fundamental importance of the value distribution: the distribution of the random return received by a reinforcement learning agent, and designs a new algorithm which applies Bellman's equation to the learning of approximate value distributions.
Fast gradient-descent methods for temporal-difference learning with linear function approximation
TLDR
Two new related algorithms with better convergence rates are introduced: the first algorithm, GTD2, is derived and proved convergent just as GTD was, but uses a different objective function and converges significantly faster (but still not as fast as conventional TD).
Dual Representations for Dynamic Programming
TLDR
The dual approach to dynamic programming and reinforcement learning is proposed, based on maintaining an explicit representation of visit distributions as opposed to value functions, a viable alternative to standard dynamic programming techniques and new avenues for developing algorithms for sequential decision making.
Reinforcement Learning with Unsupervised Auxiliary Tasks
TLDR
This paper significantly outperforms the previous state-of-the-art on Atari, averaging 880\% expert human performance, and a challenging suite of first-person, three-dimensional \emph{Labyrinth} tasks leading to a mean speedup in learning of 10$\times$ and averaging 87\% Expert human performance on Labyrinth.
...
1
2
3
...