• Corpus ID: 233219551

Online and Offline Reinforcement Learning by Planning with a Learned Model

@article{Schrittwieser2021OnlineAO,
  title={Online and Offline Reinforcement Learning by Planning with a Learned Model},
  author={Julian Schrittwieser and Thomas Hubert and Amol Mandhane and Mohammadamin Barekatain and Ioannis Antonoglou and David Silver},
  journal={ArXiv},
  year={2021},
  volume={abs/2104.06294}
}
Learning efficiently from small amounts of data has long been the focus of modelbased reinforcement learning, both for the online case when interacting with the environment and the offline case when learning from a fixed dataset. However, to date no single unified algorithm has demonstrated state-of-the-art results in both settings. In this work, we describe the Reanalyse algorithm which uses modelbased policy and value improvement operators to compute new improved training targets on existing… 
The Difficulty of Passive Learning in Deep Reinforcement Learning
TLDR
This work proposes the “tandem learning” experimental paradigm, and identifies function approximation in conjunction with fixed data distributions as the strongest factors, thereby extending but also challenging hypotheses stated in past work.
On Multi-objective Policy Optimization as a Tool for Reinforcement Learning
TLDR
The principles underlying MORL are studied and a new algorithm, Distillation of a Mixture of Experts (DiME), is introduced that is intuitive and scale-invariant under some conditions and outperforms state-of-the-art on two established offline RL benchmarks.
Scalable Online Planning via Reinforcement Learning Fine-Tuning
TLDR
This work replaces tabular search with online model-based fine-tuning of a policy neural network via reinforcement learning, and shows that this approach outperforms state-of-the-art search algorithms in benchmark settings.
Self-Consistent Models and Values
TLDR
This work investigates a way of augmenting model-based RL, by additionally encouraging a learned model and value function to be jointly self-consistent, and finds that, with appropriate choices, self- Consistency helps both policy evaluation and control.
Improve Agents without Retraining: Parallel Tree Search with Off-Policy Correction
TLDR
A novel off-policy correction term is introduced that accounts for the mismatch between the pre-trained value and its corresponding TS policy by penalizing under-sampled trajectories and it is proved that this correction eliminates the above mismatch and bound the probability of sub-optimal action selection.
Procedural Generalization by Planning with Self-Supervised World Models
TLDR
Overall, this work suggests that building generalizable agents requires moving beyond the single-task, model-free paradigm and towards self-supervised model-based agents that are trained in rich, procedural, multi-task environments.
Q-Mixing Network for Multi-Agent Pathfinding in Partially Observable Grid Environments
TLDR
This paper proposes utilizing the reinforcement learning approach when the agents, first, learn the policies that map observations to actions and then follow these policies to reach their goals, and uses a mixing Q-network that complements learning individual policies.
Muesli: Combining Improvements in Policy Optimization
TLDR
A novel policy update that combines regularized policy optimization with model learning as an auxiliary loss and does so without using deep search: it acts directly with a policy network and has computation speed comparable to model-free baselines.
Learning Emergent Random Access Protocol for LEO Satellite Networks
TLDR
This paper proposes a novel grant-free random access solution for LEO SAT networks, dubbed emergent random access channel protocol (eRACH), a model-free approach that emerges through interaction with the non-stationary network environment, using multi-agent deep reinforcement learning (MADRL).
The Benchmark Lottery
TLDR
The notion of a benchmark lottery that describes the overall fragility of the ML benchmarking process is proposed and it is argued that this might lead to biased progress in the community.
...
1
2
...

References

SHOWING 1-10 OF 46 REFERENCES
RL Unplugged: Benchmarks for Offline Reinforcement Learning
TLDR
This paper proposes a benchmark called RL Unplugged to evaluate and compare offline RL methods, a suite of benchmarks that will increase the reproducibility of experiments and make it possible to study challenging tasks with a limited computational budget, thus making RL research both more systematic and more accessible across the community.
Deployment-Efficient Reinforcement Learning via Model-Based Offline Optimization
TLDR
A novel model-based algorithm, Behavior-Regularized Model-ENsemble (BREMEN), that can effectively optimize a policy offline using 10-20 times fewer data than prior works, and is able to achieve impressive deployment efficiency while maintaining the same or better sample efficiency.
Behavior Regularized Offline Reinforcement Learning
TLDR
A general framework, behavior regularized actor critic (BRAC), is introduced to empirically evaluate recently proposed methods as well as a number of simple baselines across a variety of offline continuous control tasks.
Offline Reinforcement Learning from Images with Latent Space Models
TLDR
This work proposes to learn a latent-state dynamics model, and represent the uncertainty in the latent space of the model predictions, and significantly outperforms previous offline model-free RL methods as well as state-of-the-art online visual model-based RL methods.
DeepAveragers: Offline Reinforcement Learning by Solving Derived Non-Parametric MDPs
TLDR
This work introduces the Deep Averagers with Costs MDP (DAC-MDP) and investigates its solutions for offline RL, a non-parametric model that can leverage deep representations and account for limited data by introducing costs for exploiting under-represented parts of the model.
POPO: Pessimistic Offline Policy Optimization
TLDR
This work proposes a novel offline RL algorithm that it calls Pessimistic Offline Policy Optimization (POPO), which learns a pessimistic value function to get a strong policy and finds that POPO performs surprisingly well and scales to tasks with high-dimensional state and action space, comparing or outperforming several state-ofthe-art offline RL algorithms on benchmark tasks.
IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures
TLDR
A new distributed agent IMPALA (Importance Weighted Actor-Learner Architecture) is developed that not only uses resources more efficiently in single-machine training but also scales to thousands of machines without sacrificing data efficiency or resource utilisation.
Self-improving reactive agents based on reinforcement learning, planning and teaching
TLDR
This paper compares eight reinforcement learning frameworks: Adaptive heuristic critic (AHC) learning due to Sutton, Q-learning due to Watkins, and three extensions to both basic methods for speeding up learning and two extensions are experience replay, learning action models for planning, and teaching.
Reinforcement Learning with Unsupervised Auxiliary Tasks
TLDR
This paper significantly outperforms the previous state-of-the-art on Atari, averaging 880\% expert human performance, and a challenging suite of first-person, three-dimensional \emph{Labyrinth} tasks leading to a mean speedup in learning of 10$\times$ and averaging 87\% Expert human performance on Labyrinth.
Distributional Reinforcement Learning with Quantile Regression
TLDR
A distributional approach to reinforcement learning in which the distribution over returns is modeled explicitly instead of only estimating the mean is built, and a novel distributional reinforcement learning algorithm is presented consistent with the theoretical formulation.
...
1
2
3
4
5
...