Latent State Marginalization as a Low-cost Approach for Improving Exploration

  title={Latent State Marginalization as a Low-cost Approach for Improving Exploration},
  author={Dinghuai Zhang and Aaron C. Courville and Yoshua Bengio and Qinqing Zheng and Amy Zhang and Ricky T. Q. Chen},
While the maximum entropy (MaxEnt) reinforcement learning (RL) framework— often touted for its exploration and robustness capabilities—is usually motivated from a probabilistic perspective, the use of deep probabilistic models has not gained much traction in practice due to their inherent complexity. In this work, we propose the adoption of latent variable policies within the MaxEnt framework, which we show can provably approximate any policy distribution, and additionally, naturally emerges… 
1 Citations

Figures and Tables from this paper

Convex Potential Flows: Universal Probability Distributions with Optimal Transport and Convex Optimization

This paper introduces Convex Potential Flows (CP-Flow), a natural and efficient parameterization of invertible models inspired by the optimal transport (OT) theory, and proves that CP-Flows are universal density approximators and are optimal in the OT sense.



Flow-based Recurrent Belief State Learning for POMDPs

This paper introduces the F l O w-based R ecurrent BE lief S tate model (FORBES), which incorporates normalizing flows into the variational inference to learn general continuous belief states for POMDPs and shows that the learned belief states can be plugged into downstream RL algorithms to improve performance.

Deep Variational Reinforcement Learning for POMDPs

Deep variational reinforcement learning (DVRL) is proposed, which introduces an inductive bias that allows an agent to learn a generative model of the environment and perform inference in that model to effectively aggregate the available information.

Latent Space Policies for Hierarchical Reinforcement Learning

This work addresses the problem of learning hierarchical deep neural network policies for reinforcement learning by constraining the mapping from latent variables to actions to be invertible, and shows that this method can solve more complex sparse-reward tasks by learning higher-level policies on top of high-entropy skills optimized for simple low-level objectives.

Importance Weighted Autoencoders

The importance weighted autoencoder (IWAE), a generative model with the same architecture as the VAE, but which uses a strictly tighter log-likelihood lower bound derived from importance weighting, shows empirically that IWAEs learn richer latent space representations than VAEs, leading to improved test log- likelihood on density estimation benchmarks.

Munchausen Reinforcement Learning

It is shown that slightly modifying Deep Q-Network (DQN) in that way provides an agent that is competitive with distributional methods on Atari games, without making use of distributional RL, n-step returns or prioritized replay.

Temporal Predictive Coding For Model-Based Planning In Latent Space

This work presents an information-theoretic approach that employs temporal predictive coding to encode elements in the environment that can be predicted across time that is superior to existing methods in the challenging complex-background setting while remaining competitive with current state-of-the-art models in the standard setting.

Provably Efficient Maximum Entropy Exploration

This work studies a broad class of objectives that are defined solely as functions of the state-visitation frequencies that are induced by how the agent behaves, and provides an efficient algorithm to optimize such intrinsically defined objectives, when given access to a black box planning oracle.

Soft Actor-Critic Algorithms and Applications

Soft Actor-Critic (SAC), the recently introduced off-policy actor-critic algorithm based on the maximum entropy RL framework, achieves state-of-the-art performance, outperforming prior on-policy and off- policy methods in sample-efficiency and asymptotic performance.

A unified view of entropy-regularized Markov decision processes

A general framework for entropy-regularized average-reward reinforcement learning in Markov decision processes (MDPs) is proposed, showing that using the conditional entropy of the joint state-action distributions as regularization yields a dual optimization problem closely resembling the Bellman optimality equations.

Reinforcement Learning with Deep Energy-Based Policies

A method for learning expressive energy-based policies for continuous states and actions, which has been feasible only in tabular domains before, is proposed and a new algorithm, called soft Q-learning, that expresses the optimal policy via a Boltzmann distribution is applied.