• Corpus ID: 229923742

Robust Asymmetric Learning in POMDPs

  title={Robust Asymmetric Learning in POMDPs},
  author={Andrew Warrington and Jonathan Wilder Lavington and A. Scibior and Mark W. Schmidt and Frank D. Wood},
Policies for partially observed Markov decision processes can be efficiently learned by imitating expert policies learned using asymmetric information. Unfortunately, existing approaches for this kind of imitation learning have a serious flaw: the expert does not know what the trainee cannot see, and may therefore encourage actions that are sub-optimal or unsafe under partial information. To address this flaw, we derive an update that, when applied iteratively to an expert, maximizes the… 
Unbiased Asymmetric Actor-Critic for Partially Observable Reinforcement Learning
This work proposes an unbiased asymmetric actor-critic variant which is able to exploit state information while remaining theoretically sound, maintaining the validity of the policy gradient theorem, and introducing no bias and relatively low variance into the training process.
GridToPix: Training Embodied Agents with Minimal Supervision
GRIDTOPIX is proposed to train agents with terminal rewards in gridworlds that generically mirror Embodied AI environments, i.e., they are independent of the task; 2) distill the learned policy into agents that reside in complex visual worlds.
Hindsight Learning for MDPs with Exogenous Inputs
This work proposes an alternative approach based on hindsight learning which sidesteps modeling the exogenous process and learns better policies than domain-specific heuristics and Sim2Real RL baselines and develops an algorithm to allocate compute resources for real-world Microsoft Azure workloads.
Research and Challenges of Reinforcement Learning in Cyber Defense Decision-Making for Intranet Security
This work proposes a framework that defines four modules based on the life cycle of threats: pentest, design, response, recovery, and provides a systematic view for understanding and solving decision-making problems in the application of reinforcement learning to cyber defense.
Planning as Inference in Epidemiological Dynamics Models
This work demonstrates the use of a probabilistic programming language that automates inference in existing simulators and shows how such simulation-based models and inference automation tools applied in support of policy-making could lead to less economically damaging policy prescriptions, particularly during the current COVID-19 pandemic.
Seeing Differently, Acting Similarly: Imitation Learning with Heterogeneous Observations
This work proposes the Importance Weighting with REjection (IWRE) algorithm based on the techniques of importance-weighting, learning with rejection, and active querying to solve the key challenge of occupancy measure matching.
Seeing Differently, Acting Similarly: Heterogeneously Observable Imitation Learning
The Importance Weighting with REjection (IWRE) algorithm based on importance-weighting and learning with rejection to solve HOIL problems is proposed and results show that IWRE can successfully solve various HOIL tasks, including the challenging tasks of transforming the vision-based demonstrations to random access memory (RAM)-based policies in the Atari domain, even with limited visual observations.
Learning Visible Connectivity Dynamics for Cloth Smoothing
This work proposes to learn a particle-based dynamics model from a partial point cloud observation to overcome the challenges of partial observability, and shows that the method greatly outperforms previous state-of-the-art model-based and model-free reinforcement learning methods in simulation.
Bridging the Imitation Gap by Adaptive Insubordination
This work proposes 'Adaptive Insubordination' (ADVISOR), which dynamically reweights imitation and reward-based reinforcement learning losses during training, enabling switching between imitation and exploration.
This work demonstrates the use of a probabilistic programming language that automates inference in existing simulators and shows how such simulation-based models and inference automation tools applied in support of policymaking could lead to less economically damaging policy prescriptions, particularly during the current COVID-19 pandemic.


A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning
This paper proposes a new iterative algorithm, which trains a stationary deterministic policy, that can be seen as a no regret algorithm in an online learning setting and demonstrates that this new approach outperforms previous approaches on two challenging imitation learning problems and a benchmark sequence labeling problem.
Deep Variational Reinforcement Learning for POMDPs
Deep variational reinforcement learning (DVRL) is proposed, which introduces an inductive bias that allows an agent to learn a generative model of the environment and perform inference in that model to effectively aggregate the available information.
Co-training for Policy Learning
This work presents a meta-algorithm for co-training for sequential decision making, and proves the effectiveness of the approach across a wide range of tasks, including discrete/continuous control and combinatorial optimization.
Learning Policies for Partially Observable Environments: Scaling Up
Privileged Information Dropout in Reinforcement Learning
It is demonstrated that Privileged Information Dropout (\pid) outperforms alternatives for leveraging privileged information, including distillation and auxiliary tasks, and can successfully utilise different types of privileged information.
Asymmetric multiagent reinforcement learning
A novel method for asymmetric multiagent reinforcement learning based on Markov games is introduced, which addresses the problem where the information states of the agents involved in the learning task are not equal.
Truncated Horizon Policy Search: Combining Reinforcement Learning & Imitation Learning
This paper proposes Truncated HORizon Policy Search (THOR), a method that focuses on searching for policies that maximize the total reshaped reward over a finite planning horizon when the oracle is sub-optimal and experimentally demonstrates that a gradient-based implementation of THOR can achieve superior performance compared to RL baselines and IL baselines.
Asymmetric Actor Critic for Image-Based Robot Learning
This work exploits the full state observability in the simulator to train better policies which take as input only partial observations (RGBD images) and combines this method with domain randomization and shows real robot experiments for several tasks like picking, pushing, and moving a block.
Proximal Policy Optimization Algorithms
We propose a new family of policy gradient methods for reinforcement learning, which alternate between sampling data through interaction with the environment, and optimizing a "surrogate" objective
Bayesian Nonparametric Methods for Partially-Observable Reinforcement Learning
This article explores learning representations of stochastic systems using Bayesian nonparametric statistics and shows that the Bayesian aspects of the methods result in achieving state-of-the-art performance in decision making with relatively few samples, while the non parametric aspects often result in fewer computations.