Corpus ID: 492962

Safe and Efficient Off-Policy Reinforcement Learning

@inproceedings{Munos2016SafeAE,
  title={Safe and Efficient Off-Policy Reinforcement Learning},
  author={R{\'e}mi Munos and Tom Stepleton and Anna Harutyunyan and Marc G. Bellemare},
  booktitle={NIPS},
  year={2016}
}
In this work, we take a fresh look at some old and new algorithms for off-policy, return-based reinforcement learning. Expressing these in a common form, we derive a novel algorithm, Retrace($\lambda$), with three desired properties: (1) it has low variance; (2) it safely uses samples collected from any behaviour policy, whatever its degree of "off-policyness"; and (3) it is efficient as it makes the best use of samples collected from near on-policy behaviour policies. We analyze the… Expand

Figures, Tables, and Topics from this paper

Safe Policy Learning from Observations
TLDR
A stochastic policy improvement algorithm, termed Rerouted Behavior Improvement (RBI), that safely improves the average behavior and its primary advantages are its stability in the presence of value estimation errors and the elimination of a policy search process. Expand
Revisiting Peng's Q($\lambda$) for Modern Reinforcement Learning
TLDR
P Peng’s Q(λ), which was thought to be unsafe, is a theoretically-sound and practically effective algorithm that outperforms conservative algorithms despite its simplicity. Expand
Gap-Increasing Policy Evaluation for Efficient and Noise-Tolerant Reinforcement Learning
TLDR
Detailed theoretical analysis of the new novel policy evaluation algorithm, which leverages gap-increasing value update operators in advantage learning for noise-tolerance and off-policy eligibility trace in Retrace algorithm for efficient learning, shows that its learning is significantly efficient than that of a simple learning-rate-based approach. Expand
P3O: Policy-on Policy-off Policy Optimization
TLDR
A simple algorithm named P3O is developed that interleaves off-policy updates with on- policy updates and uses the effective sample size between the behavior policy and the target policy to control how far they can be from each other and does not introduce any additional hyper-parameters. Expand
Stabilizing Off-Policy Q-Learning via Bootstrapping Error Reduction
TLDR
A practical algorithm, bootstrapping error accumulation reduction (BEAR), is proposed and it is demonstrated that BEAR is able to learn robustly from different off-policy distributions, including random and suboptimal demonstrations, on a range of continuous control tasks. Expand
Distributionally Robust Reinforcement Learning
TLDR
This work considers risk-averse exploration in approximate RL setting and proposes the distributionally robust policy iteration scheme that provides lower bound guarantee on state-values and presents a practical algorithm that implements a different exploration strategy that acts conservatively at short-term and explores optimistically in a long-run. Expand
Relative Importance Sampling For Off-Policy Actor-Critic in Deep Reinforcement Learning
TLDR
This work presents the first relative importance sampling-off-policy actor-critic (RIS-Off-PAC) model-free algorithms in RL, which uses action value generated from the behavior policy in reward function to train the algorithm rather than from the target policy. Expand
Algorithm selection of off-policy reinforcement learning algorithm
TLDR
The article presents a novel meta-algorithm, called Epochal Stochastic Bandit Algorithm Selection (ESBAS), to freeze the policy updates at each epoch, and to leave a rebooted stochastic bandit in charge of the algorithm selection. Expand
Revisiting Peng's Q(λ) for Modern Reinforcement Learning
TLDR
P Peng’s Q(λ), which was thought to be unsafe, is a theoretically-sound and practically effective algorithm that outperforms conservative algorithms despite its simplicity. Expand
Efficiently Breaking the Curse of Horizon: Double Reinforcement Learning in Infinite-Horizon Processes
TLDR
A new estimator based on Double Reinforcement Learning (DRL) that leverages this structure for OPE and remains efficient when both are estimated at slow, nonparametric rates and remains consistent when either is estimated consistently. Expand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 26 REFERENCES
Off-Policy Temporal Difference Learning with Function Approximation
TLDR
The first algorithm for off-policy temporal-difference learning that is stable with linear function approximation is introduced and it is proved that, given training under any -soft policy, the algorithm converges w.p.1 to a close approximation to the action-value function for an arbitrary target policy. Expand
Off-policy learning with eligibility traces: a survey
TLDR
A comprehensive algorithmic derivation of all algorithms in a recursive and memory-efficent form is described, which suggests that the most standard algorithms on and off-policy LSTD(λ) if the feature space dimension is too large for a least-squares approach--perform the best. Expand
Eligibility Traces for Off-Policy Policy Evaluation
TLDR
This paper considers the off-policy version of the policy evaluation problem, for which only one eligibility trace algorithm is known, a Monte Carlo method, and analyzes and compares this and four new eligibility trace algorithms, emphasizing their relationships to the classical statistical technique known as importance sampling. Expand
Q($\lambda$) with Off-Policy Corrections
TLDR
It is proved that approximate corrections are sufficient for off-policy convergence both in policy evaluation and control, provided certain conditions relate the distance between the target and behavior policies, the eligibility trace parameter and the discount factor, and formalize an underlying tradeoff in off-Policy TD. Expand
Q(λ) with Off-Policy Corrections
TLDR
It is proved that approximate corrections are sufficient for off-policy convergence both in policy evaluation and control, provided certain conditions relate the distance between the target and behavior policies, the eligibility trace parameter and the discount factor, and formalize an underlying tradeoff in off-Policy TD(\(\lambda \)). Expand
Off-policy learning based on weighted importance sampling with linear computational complexity
TLDR
New off-policy learning algorithms that obtain the benefits of WIS with O(n) computational complexity by maintaining for each component of the parameter vector a measure of the extent to which that component has been used in previous examples. Expand
Convergence Results for Single-Step On-Policy Reinforcement-Learning Algorithms
TLDR
This paper examines the convergence of single-step on-policy RL algorithms for control with both decaying exploration and persistent exploration and provides examples of exploration strategies that result in convergence to both optimal values and optimal policies. Expand
Toward Minimax Off-policy Value Estimation
TLDR
It is shown that while the so-called regression estimator is asymptotically optimal, for small sample sizes it may perform suboptimally compared to an ideal oracle up to a multiplicative factor that depends on the number of actions. Expand
Generalization in Reinforcement Learning: Successful Examples Using Sparse Coarse Coding
TLDR
It is concluded that reinforcement learning can work robustly in conjunction with function approximators, and that there is little justification at present for avoiding the case of general λ. Expand
Reinforcement Learning: An Introduction
TLDR
This book provides a clear and simple account of the key ideas and algorithms of reinforcement learning, which ranges from the history of the field's intellectual foundations to the most recent developments and applications. Expand
...
1
2
3
...