• Corpus ID: 233423379

Autoregressive Dynamics Models for Offline Policy Evaluation and Optimization

  title={Autoregressive Dynamics Models for Offline Policy Evaluation and Optimization},
  author={Michael R. Zhang and Tom Le Paine and Ofir Nachum and Cosmin Paduraru and G. Tucker and Ziyun Wang and Mohammad Norouzi},
Standard dynamics models for continuous control make use of feedforward computation to predict the conditional distribution of next state and reward given current state and action using a multivariate Gaussian with a diagonal covariance structure. This modeling choice assumes that different dimensions of the next state and reward are conditionally independent given the current state and action and may be driven by the fact that fully observable physics-based simulation environments entail… 

A Minimax Learning Approach to Off-Policy Evaluation in Partially Observable Markov Decision Processes

This work proposes novel identification methods for OPE in POMDPs with latent confounders, by introducing bridge functions that link the target policy’s value and the observed data distribution, and proposes minimax estimation methods for learning these bridge functions.

A Minimax Learning Approach to Off-Policy Evaluation in Confounded Partially Observable Markov Decision Processes

We consider off-policy evaluation (OPE) in Partially Observable Markov Decision Processes (POMDPs), where the evaluation policy depends only on observable variables and the behavior policy depends on

On Instrumental Variable Regression for Deep Offline Policy Evaluation

This paper analyzes and compares a wide range of recent IV methods in the context of offline policy evaluation (OPE), finding empirically that state-of-the-art OPE methods are closely matched in performance by some IV methods such as AGMM, which were not developed for OPE.

Active Offline Policy Selection

This paper introduces active offline policy selection — a novel sequential decision approach that combines logged data with online interaction to identify the best policy.

Offline Policy Comparison with Confidence: Benchmarks and Baselines

This work creates benchmarks for OPC with Confidence (OPCC), derived by adding sets of policy comparison queries to datasets from offline reinforcement learning, and presents an empirical evaluation of the risk versus coverage trade-off for a class of model-based baselines.

Reliable Offline Model-based Optimization for Industrial Process Control

A dynamics model based on ensemble of conditional generative adversarial networks to achieve accurate reward calculation in industrial scenarios and an epistemic-uncertainty-penalized reward evaluation function which can effectively avoid giving over-estimated rewards to out-of-distribution inputs during the learning/searching of the optimal control policy are proposed.

Empirical Study of Off-Policy Policy Evaluation for Reinforcement Learning

This work presents the first comprehensive empirical analysis of a broad suite of OPE methods, and offers a summarized set of guidelines for effectively using OPE in practice, and suggest directions for future research.

Supervised Off-Policy Ranking

This work defines a new problem, supervised off-policy ranking (SOPR), which aims to rank a set of new/target policies based on supervised learning by leveraging off-Policy data and policies with known performance and proposes a method for supervisedOff-Policy ranking that learns a policy scoring model by correctly ranking training policies withknown performance.

User-Interactive Offline Reinforcement Learning

An algorithm is proposed that allows the user to tune this hyperparameter at runtime, thereby overcoming both of the above mentioned issues simultaneously and allowing users to start with the original behavior and grant successively greater deviation, as well as stopping at any time when the policy deteriorates or the behavior is too far from the familiar one.



Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models

This paper proposes a new algorithm called probabilistic ensembles with trajectory sampling (PETS) that combines uncertainty-aware deep network dynamics models with sampling-based uncertainty propagation, which matches the asymptotic performance of model-free algorithms on several challenging benchmark tasks, while requiring significantly fewer samples.

Keep Doing What Worked: Behavioral Modelling Priors for Offline Reinforcement Learning

This paper admits the use of data generated by arbitrary behavior policies and uses a learned prior -- the advantage-weighted behavior model (ABM) -- to bias the RL policy towards actions that have previously been executed and are likely to be successful on the new task.

DualDICE: Behavior-Agnostic Estimation of Discounted Stationary Distribution Corrections

This work proposes an algorithm, DualDICE, that is agnostic to knowledge of the behavior policy (or policies) used to generate the dataset and improves accuracy compared to existing techniques.

Objective Mismatch in Model-based Reinforcement Learning

It is demonstrated that the likelihood of one-step ahead predictions is not always correlated with control performance, a critical limitation in the standard MBRL framework which will require further research to be fully understood and addressed.

Off-policy Model-based Learning under Unknown Factored Dynamics

The G-SCOPE algorithm is introduced that evaluates a new policy based on data generated by the existing policy and is both computationally and sample efficient because it greedily learns to exploit factored structure in the dynamics of the environment.

Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation

Two bootstrapping off-policy evaluation methods which use learned MDP transition models in order to estimate lower confidence bounds on policy performance with limited data are proposed.

Model-Ensemble Trust-Region Policy Optimization

This paper analyzes the behavior of vanilla model-based reinforcement learning methods when deep neural networks are used to learn both the model and the policy, and shows that the learned policy tends to exploit regions where insufficient data is available for the model to be learned, causing instability in training.

Doubly Robust Policy Evaluation and Learning

It is proved that the doubly robust approach uniformly improves over existing techniques, achieving both lower variance in value estimation and better policies, and is expected to become common practice.

Eligibility Traces for Off-Policy Policy Evaluation

This paper considers the off-policy version of the policy evaluation problem, for which only one eligibility trace algorithm is known, a Monte Carlo method, and analyzes and compares this and four new eligibility trace algorithms, emphasizing their relationships to the classical statistical technique known as importance sampling.

Batch Policy Learning under Constraints

A new and simple method for off-policy policy evaluation (OPE) and derive PAC-style bounds is proposed and achieves strong empirical results in different domains, including in a challenging problem of simulated car driving subject to multiple constraints such as lane keeping and smooth driving.