• Corpus ID: 11227891

Reinforcement Learning with Deep Energy-Based Policies

@inproceedings{Haarnoja2017ReinforcementLW,
  title={Reinforcement Learning with Deep Energy-Based Policies},
  author={Tuomas Haarnoja and Haoran Tang and P. Abbeel and Sergey Levine},
  booktitle={ICML},
  year={2017}
}
We propose a method for learning expressive energy-based policies for continuous states and actions, which has been feasible only in tabular domains before. We apply our method to learning maximum entropy policies, resulting into a new algorithm, called soft Q-learning, that expresses the optimal policy via a Boltzmann distribution. We use the recently proposed amortized Stein variational gradient descent to learn a stochastic sampling network that approximates samples from this distribution… 

Figures from this paper

Learning Robot Skill Embeddings
TLDR
The main contribution of this work is an entropyregularized policy gradient formulation for hierarchical policies, and an associated, data-efficient and robust off-policy gradient algorithm based on stochastic value gradients.
Learning Skill Embeddings for Transferable Robot Skills
TLDR
The main contribution of this work is an entropyregularized policy gradient formulation for hierarchical policies, and an associated, data-efficient and robust off-policy gradient algorithm based on stochastic value gradients.
Bayesian Deep Q-Learning via Continuous-Time Flows
TLDR
This work proposes an efficient algorithm for Bayesian deep Q-learning by posterior sampling actions in the Q-function via continuous-time flows (CTFs), achieving efficient exploration without explicit assumptions on the forms of posterior distributions.
Deep Active Inference as Variational Policy Gradients
Soft Action Particle Deep Reinforcement Learning for a Continuous Action Space
TLDR
A new off-policy actor-critic algorithm, which can reduce a significant number of parameters compared to existing actorcritic algorithms without any performance loss is introduced.
A Regularized Implicit Policy for Offline Reinforcement Learning
TLDR
This work proposes a simple modification to the classical policy-matching methods for regularizing with respect to the dual form of the Jensen–Shannon divergence and the integral probability metrics, and theoretically shows the correctness of the policy- matching approach.
Evolved Policy Gradients
TLDR
Empirical results show that the evolved policy gradient algorithm (EPG) achieves faster learning on several randomized environments compared to an off-the-shelf policy gradient method, and its learned loss can generalize to out-of-distribution test time tasks, and exhibits qualitatively different behavior from other popular metalearning algorithms.
Quantile Regression Deep Reinforcement Learning
TLDR
This work models the policy implicitly in the network, which gives the agent the freedom to approximate any distribution in each action dimension, not limiting its capabilities to the commonly used unimodal Gaussian parameterization.
Acquiring Diverse Robot Skills via Maximum Entropy Deep Reinforcement Learning
TLDR
This thesis studies how maximum entropy framework can provide efficient deep reinforcement learning algorithms that solve tasks consistently and sample efficiently, and devise new algorithms based on this framework, starting from soft Q-learning that learns expressive energy-based policies, to soft actor-critic that provides simplicity and convenience of actor-Critic methods.
Particle-Based Adaptive Discretization for Continuous Control using Deep Reinforcement Learning
TLDR
This paper proposes a general, yet simple, framework for improving the action exploration of policy gradient DRL algorithms that adapts ideas from the particle filtering literature to dynamically discretize the continuous action space and track policies represented as a mixture of Gaussians.
...
...

References

SHOWING 1-10 OF 53 REFERENCES
Continuous control with deep reinforcement learning
TLDR
This work presents an actor-critic, model-free algorithm based on the deterministic policy gradient that can operate over continuous action spaces, and demonstrates that for many of the tasks the algorithm can learn policies end-to-end: directly from raw pixel inputs.
Actor-Critic Reinforcement Learning with Energy-Based Policies
TLDR
This work introduces the first sound and e"cient algorithm for training energy-based policies, based on an actorcritic architecture, that is computationally e-cient, converges close to a local optimum, and outperforms Sallans and Hinton (2004) in several high dimensional domains.
Continuous Deep Q-Learning with Model-based Acceleration
TLDR
This paper derives a continuous variant of the Q-learning algorithm, which it is called normalized advantage functions (NAF), as an alternative to the more commonly used policy gradient and actor-critic methods, and substantially improves performance on a set of simulated robotic control tasks.
Free-energy-based reinforcement learning in a partially observable environment
TLDR
This study extends the FERL framework to handle partially observable MDPs (POMDPs) by incorporating a recurrent neural network that learns a memory representation sufficient for predicting future observations and rewards.
Asynchronous Methods for Deep Reinforcement Learning
TLDR
A conceptually simple and lightweight framework for deep reinforcement learning that uses asynchronous gradient descent for optimization of deep neural network controllers and shows that asynchronous actor-critic succeeds on a wide variety of continuous motor control problems as well as on a new task of navigating random 3D mazes using a visual input.
Stein Variational Policy Gradient
TLDR
A novel Stein variational policy gradient method (SVPG) which combines existing policy gradient methods and a repulsive functional to generate a set of diverse but well-behaved policies is proposed.
Playing Atari with Deep Reinforcement Learning
TLDR
This work presents the first deep learning model to successfully learn control policies directly from high-dimensional sensory input using reinforcement learning, which outperforms all previous approaches on six of the games and surpasses a human expert on three of them.
Hierarchical Relative Entropy Policy Search
TLDR
This work defines the problem of learning sub-policies in continuous state action spaces as finding a hierarchical policy that is composed of a high-level gating policy to select the low-level sub-Policies for execution by the agent and treats them as latent variables which allows for distribution of the update information between the sub- policies.
Maximum Entropy Inverse Reinforcement Learning
TLDR
A probabilistic approach based on the principle of maximum entropy that provides a well-defined, globally normalized distribution over decision sequences, while providing the same performance guarantees as existing methods is developed.
Reinforcement Learning with Unsupervised Auxiliary Tasks
TLDR
This paper significantly outperforms the previous state-of-the-art on Atari, averaging 880\% expert human performance, and a challenging suite of first-person, three-dimensional \emph{Labyrinth} tasks leading to a mean speedup in learning of 10$\times$ and averaging 87\% Expert human performance on Labyrinth.
...
...