• Corpus ID: 233423379

Autoregressive Dynamics Models for Offline Policy Evaluation and Optimization

@article{Zhang2021AutoregressiveDM,
  title={Autoregressive Dynamics Models for Offline Policy Evaluation and Optimization},
  author={Michael R. Zhang and Tom Le Paine and Ofir Nachum and Cosmin Paduraru and G. Tucker and Ziyun Wang and Mohammad Norouzi},
  journal={ArXiv},
  year={2021},
  volume={abs/2104.13877}
}
Standard dynamics models for continuous control make use of feedforward computation to predict the conditional distribution of next state and reward given current state and action using a multivariate Gaussian with a diagonal covariance structure. This modeling choice assumes that different dimensions of the next state and reward are conditionally independent given the current state and action and may be driven by the fact that fully observable physics-based simulation environments entail… 
Reinforcement Learning as One Big Sequence Modeling Problem
TLDR
This work explores how RL can be reframed as “one big sequence modeling” problem, using state-of-the-art Transformer architectures to model distributions over sequences of states, actions, and rewards.
A Minimax Learning Approach to Off-Policy Evaluation in Partially Observable Markov Decision Processes
TLDR
This work proposes novel identification methods for OPE in POMDPs with latent confounders, by introducing bridge functions that link the target policy’s value and the observed data distribution, and proposes minimax estimation methods for learning these bridge functions.
A Minimax Learning Approach to Off-Policy Evaluation in Confounded Partially Observable Markov Decision Processes
We consider off-policy evaluation (OPE) in Partially Observable Markov Decision Processes (POMDPs), where the evaluation policy depends only on observable variables and the behavior policy depends on
On Instrumental Variable Regression for Deep Offline Policy Evaluation
TLDR
This paper analyzes and compares a wide range of recent IV methods in the context of offline policy evaluation (OPE), finding empirically that state-of-the-art OPE methods are closely matched in performance by some IV methods such as AGMM, which were not developed for OPE.
Importance of Representation Learning for Off-Policy Fitted Q-Evaluation
TLDR
Divergence does occur with simple feed-forward architectures, but can be mitigated using various architectures and algorithmic techniques, such as ResNet architectures, learning a shared representation between multiple target policies, and hypermodels.
Offline Policy Comparison with Confidence: Benchmarks and Baselines
TLDR
This work creates benchmarks for OPC with Confidence (OPCC), derived by adding sets of policy comparison queries to datasets from offline reinforcement learning, and presents an empirical evaluation of the risk versus coverage trade-off for a class of model-based baselines.
Reliable Offline Model-based Optimization for Industrial Process Control
TLDR
A dynamics model based on ensemble of conditional generative adversarial networks to achieve accurate reward calculation in industrial scenarios and an epistemic-uncertainty-penalized reward evaluation function which can effectively avoid giving over-estimated rewards to out-of-distribution inputs during the learning/searching of the optimal control policy are proposed.
Empirical Study of Off-Policy Policy Evaluation for Reinforcement Learning
TLDR
This work presents the first comprehensive empirical analysis of a broad suite of OPE methods, and offers a summarized set of guidelines for effectively using OPE in practice, and suggest directions for future research.
Comparing Model-free and Model-based Algorithms for Offline Reinforcement Learning
TLDR
It is found that on the IB, hybrid approaches face severe difficulties and that simpler algorithms, such as rollout based algorithms or model-free algorithms with simpler regularizers perform best on the datasets.
Supervised Off-Policy Ranking
TLDR
This work defines a new problem, supervised off-policy ranking (SOPR), which aims to rank a set of new/target policies based on supervised learning by leveraging off-Policy data and policies with known performance and proposes a method for supervisedOff-Policy ranking that learns a policy scoring model by correctly ranking training policies withknown performance.
...
...

References

SHOWING 1-10 OF 51 REFERENCES
Keep Doing What Worked: Behavioral Modelling Priors for Offline Reinforcement Learning
TLDR
This paper admits the use of data generated by arbitrary behavior policies and uses a learned prior -- the advantage-weighted behavior model (ABM) -- to bias the RL policy towards actions that have previously been executed and are likely to be successful on the new task.
DualDICE: Behavior-Agnostic Estimation of Discounted Stationary Distribution Corrections
TLDR
This work proposes an algorithm, DualDICE, that is agnostic to knowledge of the behavior policy (or policies) used to generate the dataset and improves accuracy compared to existing techniques.
Objective Mismatch in Model-based Reinforcement Learning
TLDR
It is demonstrated that the likelihood of one-step ahead predictions is not always correlated with control performance, a critical limitation in the standard MBRL framework which will require further research to be fully understood and addressed.
Off-policy Model-based Learning under Unknown Factored Dynamics
TLDR
The G-SCOPE algorithm is introduced that evaluates a new policy based on data generated by the existing policy and is both computationally and sample efficient because it greedily learns to exploit factored structure in the dynamics of the environment.
Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation
TLDR
Two bootstrapping off-policy evaluation methods which use learned MDP transition models in order to estimate lower confidence bounds on policy performance with limited data are proposed.
Model-Ensemble Trust-Region Policy Optimization
TLDR
This paper analyzes the behavior of vanilla model-based reinforcement learning methods when deep neural networks are used to learn both the model and the policy, and shows that the learned policy tends to exploit regions where insufficient data is available for the model to be learned, causing instability in training.
Doubly Robust Policy Evaluation and Learning
TLDR
It is proved that the doubly robust approach uniformly improves over existing techniques, achieving both lower variance in value estimation and better policies, and is expected to become common practice.
Eligibility Traces for Off-Policy Policy Evaluation
TLDR
This paper considers the off-policy version of the policy evaluation problem, for which only one eligibility trace algorithm is known, a Monte Carlo method, and analyzes and compares this and four new eligibility trace algorithms, emphasizing their relationships to the classical statistical technique known as importance sampling.
Batch Policy Learning under Constraints
TLDR
A new and simple method for off-policy policy evaluation (OPE) and derive PAC-style bounds is proposed and achieves strong empirical results in different domains, including in a challenging problem of simulated car driving subject to multiple constraints such as lane keeping and smooth driving.
Batch Stationary Distribution Estimation
TLDR
A variational power method (VPM) is developed that provides provably consistent estimates under general conditions and is found to yields significantly better estimates across a range of problems, including queueing, stochastic differential equations, post-processing MCMC, and off-policy evaluation.
...
...