Corpus ID: 224704606

Variational Dynamic for Self-Supervised Exploration in Deep Reinforcement Learning

  title={Variational Dynamic for Self-Supervised Exploration in Deep Reinforcement Learning},
  author={Chenjia Bai and Peng Liu and Zhaoran Wang and Kaiyu Liu and Lingxiao Wang and Yingnan Zhao},
Efficient exploration remains a challenging problem in reinforcement learning, especially for tasks where extrinsic rewards from environments are sparse or even totally disregarded. Significant advances based on intrinsic motivation show promising results in simple environments but often get stuck in environments with multimodal and stochastic dynamics. In this work, we propose a variational dynamic model based on the conditional variational inference to model the multimodality and… Expand
Self-Supervised Exploration via Latent Bayesian Surprise
A curiosity-based bonus as intrinsic reward for Reinforcement Learning is proposed, computed as the Bayesian surprise with respect to a latent state variable, learnt by reconstructing fixed random features. Expand
Exploration in Deep Reinforcement Learning: A Comprehensive Survey
  • Tianpei Yang, Hongyao Tang, +4 authors Peng Liu
  • Computer Science
  • ArXiv
  • 2021
A comprehensive survey on existing exploration methods for both single-agent and multi-agent reinforcement learning for the purpose of providing understandings and insights on the critical problems and solutions and observes the general improvement of uncertainty-oriented exploration methods in almost all the environments. Expand
Principled Exploration via Optimistic Bootstrapping and Backward Induction
OB2I constructs a generalpurpose UCB-bonus through non-parametric bootstrap in DRL and propagates future uncertainty in a time-consistent manner through episodic backward update, which exploits the theoretical advantage and empirically improves the sample-efficiency. Expand


Self-Supervised Exploration via Disagreement
This paper proposes a formulation for exploration inspired by the work in active learning literature and trains an ensemble of dynamics models and incentivizes the agent to explore such that the disagreement of those ensembles is maximized, which results in a sample-efficient exploration. Expand
Planning to Explore via Self-Supervised World Models
Without any training supervision or task-specific interaction, Plan2Explore outperforms prior self-supervised exploration methods, and in fact, almost matches the performances oracle which has access to rewards. Expand
Scheduled Intrinsic Drive: A Hierarchical Take on Intrinsically Motivated Exploration
A new type of intrinsic reward denoted as successor feature control (SFC) is introduced, which takes into account statistics over complete trajectories and thus differs from previous methods that only use local information to evaluate intrinsic motivation. Expand
VIME: Variational Information Maximizing Exploration
VIME is introduced, an exploration strategy based on maximization of information gain about the agent's belief of environment dynamics which efficiently handles continuous state and action spaces and can be applied with several different underlying RL algorithms. Expand
Diversity-Driven Exploration Strategy for Deep Reinforcement Learning
By simply adding a distance measure to the loss function, the proposed methodology significantly enhances an agent's exploratory behaviors, and thus preventing the policy from being trapped in local optima and an adaptive scaling method for stabilizing the learning process is proposed. Expand
Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models
This paper proposes a new algorithm called probabilistic ensembles with trajectory sampling (PETS) that combines uncertainty-aware deep network dynamics models with sampling-based uncertainty propagation, which matches the asymptotic performance of model-free algorithms on several challenging benchmark tasks, while requiring significantly fewer samples. Expand
Curiosity-Driven Exploration by Self-Supervised Prediction
This work forms curiosity as the error in an agent's ability to predict the consequence of its own actions in a visual feature space learned by a self-supervised inverse dynamics model, which scales to high-dimensional continuous state spaces like images, bypasses the difficulties of directly predicting pixels, and ignores the aspects of the environment that cannot affect the agent. Expand
EMI: Exploration with Mutual Information
This work proposes EMI, which is an exploration method that constructs embedding representation of states and actions that does not rely on generative decoding of the full observation but extracts predictive signals that can be used to guide exploration based on forward prediction in the representation space. Expand
MULEX: Disentangling Exploitation from Exploration in Deep RL
This work adopts a disruptive but simple and generic perspective, where it explicitly disentangle exploration and exploitation, and uses off-policy methods to optimize each loss. Expand
Never Give Up: Learning Directed Exploration Strategies
This work constructs an episodic memory-based intrinsic reward using k-nearest neighbors over the agent's recent experience to train the directed exploratory policies, thereby encouraging the agent to repeatedly revisit all states in its environment. Expand