How to Combine Tree-Search Methods in Reinforcement Learning

@article{Efroni2019HowTC,
  title={How to Combine Tree-Search Methods in Reinforcement Learning},
  author={Yonathan Efroni and Gal Dalal and Bruno Scherrer and Shie Mannor},
  journal={ArXiv},
  year={2019},
  volume={abs/1809.01843}
}
Finite-horizon lookahead policies are abundantly used in Reinforcement Learning and demonstrate impressive empirical success. [] Key Method Our proposed enhancement is straightforward and simple: use the return from the optimal tree path to back up the values at the descendants of the root. This leads to a $\gamma^h$-contracting procedure, where $\gamma$ is the discount factor and $h$ is the tree depth. To establish our results, we first introduce a notion called \emph{multiple-step greedy consistency}. We…

Figures from this paper

Improve Agents without Retraining: Parallel Tree Search with Off-Policy Correction
TLDR
A novel off-policy correction term is introduced that accounts for the mismatch between the pre-trained value and its corresponding TS policy by penalizing under-sampled trajectories and it is proved that this correction eliminates the above mismatch and bound the probability of sub-optimal action selection.
Planning and Learning with Adaptive Lookahead
TLDR
This work proposes for the first time to dynamically adapt the multi-step lookahead horizon as a function of the state and of the value estimate, and devise two PI variants and analyze the trade-off between iteration count and computational complexity per iteration.
Local Search for Policy Iteration in Continuous Control
TLDR
An algorithm for local, regularized, policy improvement in reinforcement learning (RL) that allows us to formulate model-based and model-free variants in a single framework and introduces a form of tree search for continuous action spaces.
A Framework for Reinforcement Learning and Planning
TLDR
A unifying framework for reinforcement learning and planning (FRAP), which identifies the underlying dimensions on which any planning or learning algorithm has to decide, and suggests new approaches to integration of both fields.
Tight Regret Bounds for Model-Based Reinforcement Learning with Greedy Policies
TLDR
It is established that exploring with greedy policies -- act by 1-step planning -- can achieve tight minimax performance in terms of regret, and full-planning in model-based RL can be avoided altogether without any performance degradation, and the computational complexity decreases.
Value-based Algorithms Optimization with Discounted Multiple-step Learning Method in Deep Reinforcement Learning
  • Haibo Deng, Shiqun Yin, Xiaohong Deng, Shiwei Li
  • Computer Science
    2020 IEEE 22nd International Conference on High Performance Computing and Communications; IEEE 18th International Conference on Smart City; IEEE 6th International Conference on Data Science and Systems (HPCC/SmartCity/DSS)
  • 2020
TLDR
This paper proposes a straightforward optimal method — Discount Multiple-steps Learning Method (DMLM) to improve the performance of value-based algorithms by giving a discount factor to truncated N-step return which shows better results in the authors' experiments.
The Role of Lookahead and Approximate Policy Evaluation in Policy Iteration with Linear Value Function Approximation
TLDR
This paper shows that when linear function approximation is used to represent the value function, a certain minimum amount of lookahead and multi-step return is needed for the algorithm to even converge, and characterize the performance of policies obtained using such approximate policy iteration.
Greedy Multi-step Off-Policy Reinforcement Learning
TLDR
A novelbootstrapping method, which greedily takes the maximum value among the bootstrapping values with varying steps, and derives new model-free RL algorithms named Greedy MultiStep Q Learning (and Greedy multi-step DQN).
Multi-Step Greedy and Approximate Real Time Dynamic Programming
TLDR
This paper analyzes the sample, computation, and space complexities of the generalized multi-step greedy version of RTDP and establishes that increasing h improves sample and space complexity, with the cost of additional offline computational operations.
Real-time tree search with pessimistic scenarios: Winning the NeurIPS 2018 Pommerman Competition
TLDR
A technique of tree search where a deterministic and pessimistic scenario is used after a specified depth where there is no branching with the deterministic scenario, which allows us to take into account the events that can occur far ahead in the future.
...
1
2
...

References

SHOWING 1-10 OF 38 REFERENCES
Feedback-Based Tree Search for Reinforcement Learning
TLDR
This work provides the first sample complexity bounds for a tree search-based RL algorithm and shows that a deep neural network implementation of the technique can create a competitive AI agent for the popular multi-player online battle arena (MOBA) game King of Glory.
Beyond the One Step Greedy Approach in Reinforcement Learning
TLDR
This work formulate variants of multiple-step policy improvement, derive new algorithms using these definitions and prove their convergence, and shows that recent prominent Reinforcement Learning algorithms are, in fact, instances of this framework.
Multiple-Step Greedy Policies in Online and Approximate Reinforcement Learning
TLDR
This work forms and analyzes online and approximate algorithms that use a multi-step greedy operator, and highlights a counter-intuitive difficulty, arising with soft-policy updates: even in the absence of approximations, monotonic policy improvement is not guaranteed unless the update stepsize is sufficiently large.
Is Q-learning Provably Efficient?
Model-free reinforcement learning (RL) algorithms, such as Q-learning, directly parameterize and update value functions or policies without explicitly modeling the environment. They are typically
From Bandits to Monte-Carlo Tree Search: The Optimistic Principle Applied to Optimization and Planning
  • R. Munos
  • Computer Science
    Found. Trends Mach. Learn.
  • 2014
TLDR
The main idea presented here is that it is possible to decompose a complex decision making problem into a sequence of elementary decisions, where each decision of the sequence is solved using a (stochastic) multi-armed bandit (simple mathematical model for decision making in stochastic environments).
Learning from the hindsight plan — Episodic MPC improvement
TLDR
This work considers the iterative learning setting, where the same task can be repeated several times, and proposes a policy improvement scheme for model predictive control, and learns to re-shape the original cost function with the goal of satisfying the following property: short horizon planning (as realistic during real executions) with respect to the shaped cost should result in mimicking the hindsight plan.
Temporal Differences-Based Policy Iteration and Applications in Neuro-Dynamic Programming
We introduce a new policy iteration method for dynamic programming problems with discounted and undiscounted cost. The method is based on the notion of temporal differences, and is primarily geared
Learning-based model predictive control for Markov decision processes
TLDR
This work proposes value functions, a means to deal with issues arising in conventional MPC, e.g., computational requirements and sub-optimality of actions, and uses reinforcement learning to let an MPC agent learn a value function incrementally.
Non-Stationary Approximate Modified Policy Iteration
TLDR
A new algorithmic scheme, Non-Stationary Modified Policy Iteration, a family of algorithms parameterized by two integers m ≥ 0 and l ≥ 1 that generalizes all the above mentionned algorithms and enjoys the improved 2γe/(1-γ)( 1-γl)-optimality guarantee.
Reinforcement Learning: An Introduction
TLDR
This book provides a clear and simple account of the key ideas and algorithms of reinforcement learning, which ranges from the history of the field's intellectual foundations to the most recent developments and applications.
...
1
2
3
4
...