No More Pesky Hyperparameters: Offline Hyperparameter Tuning for RL

  title={No More Pesky Hyperparameters: Offline Hyperparameter Tuning for RL},
  author={Han Wang and Archit Sakhadeo and Adam White and James Bell and Vincent Liu and Xutong Zhao and Puer Liu and Tadashi Kozuno and Alona Fyshe and Martha White},
The performance of reinforcement learning (RL) agents is sensitive to the choice of hyperparameters. In real-world settings like robotics or industrial control systems, however, testing different hyperparameter configurations directly on the environment can be finan-cially prohibitive, dangerous, or time consuming. We propose a new approach to tune hyperparameters from offline logs of data, to fully specify the hyperparameters for an RL agent that learns online in the real world. The approach is… 



Hyperparameter Selection for Offline Reinforcement Learning

This work focuses on offline hyperparameter selection, i.e. methods for choosing the best policy from a set of many policies trained using different hyperparameters, given only logged data, and shows that offline RL algorithms are not robust tohyperparameter choices.

A Self-Tuning Actor-Critic Algorithm

This paper applies the algorithm, Self-Tuning Actor-Critic (STAC), to self-tune all the differentiable hyperparameters of an actor-critic loss function, to discover auxiliary tasks, and to improve off-policy learning using a novel leaky V-trace operator.

Behavior Regularized Offline Reinforcement Learning

A general framework, behavior regularized actor critic (BRAC), is introduced to empirically evaluate recently proposed methods as well as a number of simple baselines across a variety of offline continuous control tasks.

Offline-to-Online Reinforcement Learning via Balanced Replay and Pessimistic Q-Ensemble

A balanced replay scheme that priori-tizes samples encountered online while also encouraging the use of near-on-policy samples from the offline dataset is proposed and shows that the proposed method improves sample-efficiency and performance of the fine-tuned robotic agents on various locomotion and manipulation tasks.

Provably Efficient Online Hyperparameter Optimization with Population-Based Bandits

This work introduces the first provably efficient PBT-style algorithm, Population-Based Bandits (PB2), which uses a probabilistic model to guide the search in an efficient way, making it possible to discover high performing hyperparameter configurations with far fewer agents than typically required by PBT.

Fast Efficient Hyperparameter Tuning for Policy Gradient Methods

This paper proposes Hyperparameter Optimisation on the Fly (HOOF), a gradient-free algorithm that requires no more than one training run to automatically adapt the hyperparameter that affect the policy update directly through the gradient.

Offline Evaluation of Online Reinforcement Learning Algorithms

This work develops three new evaluation approaches which guarantee that, given some history, algorithms are fed samples from the distribution that they would have encountered if they were run online.

Population Based Training of Neural Networks

Population Based Training is presented, a simple asynchronous optimisation algorithm which effectively utilises a fixed computational budget to jointly optimise a population of models and their hyperparameters to maximise performance.

BOHB: Robust and Efficient Hyperparameter Optimization at Scale

This work proposes a new practical state-of-the-art hyperparameter optimization method, which consistently outperforms both Bayesian optimization and Hyperband on a wide range of problem types, including high-dimensional toy functions, support vector machines, feed-forward neural networks, Bayesian Neural networks, deep reinforcement learning, and convolutional neural networks.

Online Hyper-parameter Tuning in Off-policy Learning via Evolutionary Strategies

This work proposes a framework which entails the application of Evolutionary Strategies to online hyper-parameter tuning in off-policy learning and shows that this method outperforms state-of-the-art off-Policy learning baselines with static hyper- parameters and recent prior work over a wide range of continuous control benchmarks.