Recurrent networks, hidden states and beliefs in partially observable environments

  title={Recurrent networks, hidden states and beliefs in partially observable environments},
  author={Gaspard Lambrechts and Adrien Bolland and Damien Ernst},
Reinforcement learning aims to learn optimal policies from interaction with environments whose dynamics are unknown. Many methods rely on the approximation of a value function to derive near-optimal policies. In partially observable environments, these functions de-pend on the complete sequence of observations and past actions, called the history. In this work, we show empirically that recurrent neural networks trained to approximate such value functions internally filter the posterior… 

PhysQ: A Physics Informed Reinforcement Learning Framework for Building Control

Large-scale integration of intermittent renewable energy sources calls for substantial demand side flexibility. Given that the built environment accounts for approximately 40% of total energy



Deep Variational Reinforcement Learning for POMDPs

Deep variational reinforcement learning (DVRL) is proposed, which introduces an inductive bias that allows an agent to learn a generative model of the environment and perform inference in that model to effectively aggregate the available information.

Memory-based control with recurrent neural networks

This work extends two related, model-free algorithms for continuous control to solve partially observed domains using recurrent neural networks trained with backpropagation through time to find that recurrent deterministic and stochastic policies are able to learn similarly good solutions to these tasks, including the water maze where the agent must learn effective search strategies.

Deep Recurrent Q-Learning for Partially Observable MDPs

The effects of adding recurrency to a Deep Q-Network is investigated by replacing the first post-convolutional fully-connected layer with a recurrent LSTM, which successfully integrates information through time and replicates DQN's performance on standard Atari games and partially observed equivalents featuring flickering game screens.

On Improving Deep Reinforcement Learning for POMDPs

This work proposes a new architecture called Action-specific Deep Recurrent Q-Network (ADRQN) to enhance learning performance in partially observable domains and demonstrates the effectiveness of the new architecture in several partially observable domains, including flickering Atari games.

Human-level control through deep reinforcement learning

This work bridges the divide between high-dimensional sensory inputs and actions, resulting in the first artificial agent that is capable of learning to excel at a diverse array of challenging tasks.

Continuous control with deep reinforcement learning

This work presents an actor-critic, model-free algorithm based on the deterministic policy gradient that can operate over continuous action spaces, and demonstrates that for many of the tasks the algorithm can learn policies end-to-end: directly from raw pixel inputs.

Meta-trained agents implement Bayes-optimal agents

It is shown that meta-learned and Bayes-optimal agents not only behave alike, but they even share a similar computational structure, in the sense that one agent system can approximately simulate the other.

The Optimal Control of Partially Observable Markov Processes over a Finite Horizon

If there are only a finite number of control intervals remaining, then the optimal payoff function is a piecewise-linear, convex function of the current state probabilities of the internal Markov process, and an algorithm for utilizing this property to calculate the optimal control policy and payoff function for any finite horizon is outlined.

Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor

This paper proposes soft actor-critic, an off-policy actor-Critic deep RL algorithm based on the maximum entropy reinforcement learning framework, and achieves state-of-the-art performance on a range of continuous control benchmark tasks, outperforming prior on-policy and off- policy methods.

QMDP-Net: Deep Learning for Planning under Partial Observability

While QMDP-net encodes theQMDP algorithm, it sometimes outperforms the QM DP algorithm in the experiments, as a result of end-to-end learning.