Corpus ID: 28202810

Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor

@inproceedings{Haarnoja2018SoftAO,
  title={Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor},
  author={Tuomas Haarnoja and Aurick Zhou and P. Abbeel and Sergey Levine},
  booktitle={ICML},
  year={2018}
}
Model-free deep reinforcement learning (RL) algorithms have been demonstrated on a range of challenging decision making and control tasks. [...] Key Method In this framework, the actor aims to maximize expected reward while also maximizing entropy - that is, succeed at the task while acting as randomly as possible. Prior deep RL methods based on this framework have been formulated as Q-learning methods.Expand
Soft Actor-Critic Algorithms and Applications
TLDR
Soft Actor-Critic (SAC), the recently introduced off-policy actor-critic algorithm based on the maximum entropy RL framework, achieves state-of-the-art performance, outperforming prior on-policy and off- policy methods in sample-efficiency and asymptotic performance. Expand
Soft Policy Gradient Method for Maximum Entropy Deep Reinforcement Learning
TLDR
This paper presents an off-policy actor-critic, model-free maximum entropy deep RL algorithm called deep soft policy gradient (DSPG) by combining hard policy gradient with soft Bellman equation to ensure stable learning while eliminating the need of two separate critics for soft value functions. Expand
Generative Actor-Critic: An Off-policy Algorithm Using the Push-forward Model
  • Lingwei Peng, Hui Qian, Zhebang Shen, Chao Zhang, Fei Li
  • Computer Science
  • 2021
TLDR
A density-free off-policy algorithm, Generative Actor-Critic (GAC), is proposed, using the push-forward model to increase the expressiveness of policies, which also includes an entropy-like technique, MMD-entropy regularizer, to balance the exploration and exploitation. Expand
Generative Actor-Critic: An Off-policy Algorithm Using the Push-forward Model
TLDR
A density-free off-policy algorithm, Generative Actor-Critic (GAC), is proposed, using the push-forward model to increase the expressiveness of policies, which also includes an entropy-like technique, MMD-entropy regularizer, to balance the exploration and exploitation. Expand
Soft Action Particle Deep Reinforcement Learning for a Continuous Action Space
TLDR
A new off-policy actor-critic algorithm, which can reduce a significant number of parameters compared to existing actorcritic algorithms without any performance loss is introduced. Expand
Off-Policy Actor-Critic in an Ensemble: Achieving Maximum General Entropy and Effective Environment Exploration in Deep Reinforcement Learning
TLDR
A new policy iteration theory is proposed as an important extension of soft policy iteration and Soft Actor-Critic (SAC), one of the most efficient model free algorithms for deep reinforcement learning, and arbitrary entropy measures that generalize Shannon entropy can be utilized to properly randomize action selection. Expand
CONTINUOUS CONTROL
Some of the most successful applications of deep reinforcement learning to challenging domains in discrete and continuous control have used policy gradient methods in the on-policy setting. However,Expand
Deep Reinforcement Learning with Dynamic Optimism
TLDR
This work shows that the optimal degree of optimism can vary both across tasks and over the course of learning, and introduces a novel deep actor-critic algorithm, Dynamic Optimistic and Pessimistic Estimation (DOPE), to switch between optimistic and pessimistic value learning online by formulating the selection as a multi-arm bandit problem. Expand
Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Variables
TLDR
This paper develops an off-policy meta-RL algorithm that disentangles task inference and control and performs online probabilistic filtering of latent task variables to infer how to solve a new task from small amounts of experience. Expand
Improving Exploration in Soft-Actor-Critic with Normalizing Flows Policies
TLDR
This work introduces Normalizing Flow policies within the SAC framework that learn more expressive classes of policies than simple factored Gaussians and shows empirically on continuous grid world tasks that the approach increases stability and is better suited to difficult exploration in sparse reward settings. Expand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 40 REFERENCES
Q-Prop: Sample-Efficient Policy Gradient with An Off-Policy Critic
© ICLR 2019 - Conference Track Proceedings. All rights reserved. Model-free deep reinforcement learning (RL) methods have been successful in a wide variety of simulated domains. However, a majorExpand
The Reactor: A Sample-Efficient Actor-Critic Architecture
TLDR
A new reinforcement learning agent, called Reactor (for Retraceactor), based on an off-policy multi-step return actor-critic architecture that is sample-efficient thanks to the use of memory replay, and numerical efficient since it uses multi- step returns. Expand
The Reactor: A fast and sample-efficient Actor-Critic agent for Reinforcement Learning
TLDR
This work introduces a new policy evaluation algorithm called Distributional Retrace, which brings multi-step off-policy updates to the distributional reinforcement learning setting, and introduces the \b{eta}-leave-one-out policy gradient algorithm which improves the trade-off between variance and bias by using action values as a baseline. Expand
Bridging the Gap Between Value and Policy Based Reinforcement Learning
TLDR
A new RL algorithm, Path Consistency Learning (PCL), is developed that minimizes a notion of soft consistency error along multi-step action sequences extracted from both on- and off-policy traces and significantly outperforms strong actor-critic and Q-learning baselines across several benchmarks. Expand
Continuous control with deep reinforcement learning
TLDR
This work presents an actor-critic, model-free algorithm based on the deterministic policy gradient that can operate over continuous action spaces, and demonstrates that for many of the tasks the algorithm can learn policies end-to-end: directly from raw pixel inputs. Expand
Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates
TLDR
It is demonstrated that a recent deep reinforcement learning algorithm based on off-policy training of deep Q-functions can scale to complex 3D manipulation tasks and can learn deep neural network policies efficiently enough to train on real physical robots. Expand
Reinforcement Learning with Deep Energy-Based Policies
TLDR
A method for learning expressive energy-based policies for continuous states and actions, which has been feasible only in tabular domains before, is proposed and a new algorithm, called soft Q-learning, that expresses the optimal policy via a Boltzmann distribution is applied. Expand
Addressing Function Approximation Error in Actor-Critic Methods
TLDR
This paper builds on Double Q-learning, by taking the minimum value between a pair of critics to limit overestimation, and draws the connection between target networks and overestimation bias. Expand
Benchmarking Deep Reinforcement Learning for Continuous Control
TLDR
This work presents a benchmark suite of continuous control tasks, including classic tasks like cart-pole swing-up, tasks with very high state and action dimensionality such as 3D humanoid locomotion, task with partial observations, and tasks with hierarchical structure. Expand
Taming the Noise in Reinforcement Learning via Soft Updates
TLDR
G-learning is proposed, a new off-policy learning algorithm that regularizes the noise in the space of optimal actions by penalizing deterministic policies at the beginning of the learning, which enables naturally incorporating prior distributions over optimal actions when available. Expand
...
1
2
3
4
...