• Corpus ID: 224819491

Iterative Amortized Policy Optimization

@article{Marino2020IterativeAP,
  title={Iterative Amortized Policy Optimization},
  author={Joseph Marino and Alexandre Pich{\'e} and Alessandro Davide Ialongo and Yisong Yue},
  journal={ArXiv},
  year={2020},
  volume={abs/2010.10670}
}
Policy networks are a central feature of deep reinforcement learning (RL) algorithms for continuous control, enabling the estimation and sampling of high-value actions. From the variational inference perspective on RL, policy networks, when employed with entropy or KL regularization, are a form of amortized optimization, optimizing network parameters rather than the policy distributions directly. However, this direct amortized mapping can empirically yield suboptimal policy estimates. Given… 
Improving Actor-Critic Reinforcement Learning via Hamiltonian Monte Carlo Method
  • Duo Xu, F. Fekri
  • Computer Science, Mathematics
  • 2021
TLDR
This work proposes to integrate the policy network of actor-critic RL with HMC, which is termed as Hamiltonian Policy, and finds that the proposed method can not only improve the achieved return, but also reduce safety constraint violations by discarding potentially unsafe actions.
Improving Actor-Critic Reinforcement Learning via Hamiltonian Policy
TLDR
This work, inspired by the previous use of Hamiltonian Monte Carlo (HMC) in VI, proposes to integrate policy optimization with HMC and proposes a new leapfrog operator to simulate the Hamiltonian dynamics.
On the model-based stochastic value gradient for continuous reinforcement learning
TLDR
This paper surpasses the asymptotic performance of other model-based methods on the proprioceptive MuJoCo locomotion tasks from the OpenAI gym, including a humanoid, and achieves these results with a simple deterministic world model without requiring an ensemble.
Predictive Coding, Variational Autoencoders, and Biological Connections
TLDR
A review of predictive coding, from theoretical neuroscience, and variational autoencoders, from machine learning, identifying the common origin and mathematical framework underlying both areas and discussing two possible correspondences implied by this perspective.
Scalable Online Planning via Reinforcement Learning Fine-Tuning
TLDR
This work replaces tabular search with online model-based fine-tuning of a policy neural network via reinforcement learning, and shows that this approach outperforms state-of-the-art search algorithms in benchmark settings.
Stochastic Iterative Graph Matching
TLDR
This work devise several techniques to improve the learning of GNNs and obtain a new model, Stochastic Iterative Graph MAtching (SIGMA), which predicts a distribution of matchings, instead of a single matching, for a graph pair so the model can explore several probable matchings.

References

SHOWING 1-10 OF 94 REFERENCES
Inference Suboptimality in Variational Autoencoders
TLDR
It is found that divergence from the true posterior is often due to imperfect recognition networks, rather than the limited complexity of the approximating distribution, and the parameters used to increase the expressiveness of the approximation play a role in generalizing inference.
Deep Reinforcement Learning with Double Q-Learning
TLDR
This paper proposes a specific adaptation to the DQN algorithm and shows that the resulting algorithm not only reduces the observed overestimations, as hypothesized, but that this also leads to much better performance on several games.
Adam: A Method for Stochastic Optimization
TLDR
This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.
MuJoCo: A physics engine for model-based control
TLDR
A new physics engine tailored to model-based control, based on the modern velocity-stepping approach which avoids the difficulties with spring-dampers, which can compute both forward and inverse dynamics.
The Cross-Entropy Method: A Unified Approach to Combinatorial Optimization, Monte-Carlo Simulation, and Machine Learning
  • L. Deng
  • Mathematics, Computer Science
    Technometrics
  • 2006
Furthermore, if i and j are neighboring locations, then the correlation of observations at those points conditional on all other observations is Qij/ √ QiiQjj , and the conditional mean and precision
Addressing Function Approximation Error in Actor-Critic Methods
TLDR
This paper builds on Double Q-learning, by taking the minimum value between a pair of critics to limit overestimation, and draws the connection between target networks and overestimation bias.
Latent Space Policies for Hierarchical Reinforcement Learning
TLDR
This work addresses the problem of learning hierarchical deep neural network policies for reinforcement learning by constraining the mapping from latent variables to actions to be invertible, and shows that this method can solve more complex sparse-reward tasks by learning higher-level policies on top of high-entropy skills optimized for simple low-level objectives.
Maximum a Posteriori Policy Optimisation
TLDR
This work introduces a new algorithm for reinforcement learning called Maximum aposteriori Policy Optimisation (MPO) based on coordinate ascent on a relative entropy objective and develops two off-policy algorithms that are competitive with the state-of-the-art in deep reinforcement learning.
Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review
TLDR
This article will discuss how a generalization of the reinforcement learning or optimal control problem, which is sometimes termed maximum entropy reinforcement learning, is equivalent to exact probabilistic inference in the case of deterministic dynamics, and variational inference inThe case of stochastic dynamics.
Soft Actor-Critic Algorithms and Applications
TLDR
Soft Actor-Critic (SAC), the recently introduced off-policy actor-critic algorithm based on the maximum entropy RL framework, achieves state-of-the-art performance, outperforming prior on-policy and off- policy methods in sample-efficiency and asymptotic performance.
...
1
2
3
4
5
...