Autoregressive Dynamics Models for Offline Policy Evaluation and Optimization
@article{Zhang2021AutoregressiveDM, title={Autoregressive Dynamics Models for Offline Policy Evaluation and Optimization}, author={Michael R. Zhang and Tom Le Paine and Ofir Nachum and Cosmin Paduraru and G. Tucker and Ziyun Wang and Mohammad Norouzi}, journal={ArXiv}, year={2021}, volume={abs/2104.13877} }
Standard dynamics models for continuous control make use of feedforward computation to predict the conditional distribution of next state and reward given current state and action using a multivariate Gaussian with a diagonal covariance structure. This modeling choice assumes that different dimensions of the next state and reward are conditionally independent given the current state and action and may be driven by the fact that fully observable physics-based simulation environments entail…
Figures and Tables from this paper
13 Citations
Reinforcement Learning as One Big Sequence Modeling Problem
- Computer ScienceNeurIPS
- 2021
This work explores how RL can be reframed as “one big sequence modeling” problem, using state-of-the-art Transformer architectures to model distributions over sequences of states, actions, and rewards.
A Minimax Learning Approach to Off-Policy Evaluation in Partially Observable Markov Decision Processes
- MathematicsArXiv
- 2021
This work proposes novel identification methods for OPE in POMDPs with latent confounders, by introducing bridge functions that link the target policy’s value and the observed data distribution, and proposes minimax estimation methods for learning these bridge functions.
A Minimax Learning Approach to Off-Policy Evaluation in Confounded Partially Observable Markov Decision Processes
- MathematicsICML
- 2022
We consider off-policy evaluation (OPE) in Partially Observable Markov Decision Processes (POMDPs), where the evaluation policy depends only on observable variables and the behavior policy depends on…
On Instrumental Variable Regression for Deep Offline Policy Evaluation
- EconomicsArXiv
- 2021
This paper analyzes and compares a wide range of recent IV methods in the context of offline policy evaluation (OPE), finding empirically that state-of-the-art OPE methods are closely matched in performance by some IV methods such as AGMM, which were not developed for OPE.
Importance of Representation Learning for Off-Policy Fitted Q-Evaluation
- Computer Science
- 2021
Divergence does occur with simple feed-forward architectures, but can be mitigated using various architectures and algorithmic techniques, such as ResNet architectures, learning a shared representation between multiple target policies, and hypermodels.
Offline Policy Comparison with Confidence: Benchmarks and Baselines
- Computer Science
- 2022
This work creates benchmarks for OPC with Confidence (OPCC), derived by adding sets of policy comparison queries to datasets from offline reinforcement learning, and presents an empirical evaluation of the risk versus coverage trade-off for a class of model-based baselines.
Reliable Offline Model-based Optimization for Industrial Process Control
- Computer ScienceArXiv
- 2022
A dynamics model based on ensemble of conditional generative adversarial networks to achieve accurate reward calculation in industrial scenarios and an epistemic-uncertainty-penalized reward evaluation function which can effectively avoid giving over-estimated rewards to out-of-distribution inputs during the learning/searching of the optimal control policy are proposed.
Empirical Study of Off-Policy Policy Evaluation for Reinforcement Learning
- Computer ScienceNeurIPS Datasets and Benchmarks
- 2021
This work presents the first comprehensive empirical analysis of a broad suite of OPE methods, and offers a summarized set of guidelines for effectively using OPE in practice, and suggest directions for future research.
Comparing Model-free and Model-based Algorithms for Offline Reinforcement Learning
- Computer ScienceArXiv
- 2022
It is found that on the IB, hybrid approaches face severe difficulties and that simpler algorithms, such as rollout based algorithms or model-free algorithms with simpler regularizers perform best on the datasets.
Supervised Off-Policy Ranking
- Computer ScienceICML
- 2022
This work defines a new problem, supervised off-policy ranking (SOPR), which aims to rank a set of new/target policies based on supervised learning by leveraging off-Policy data and policies with known performance and proposes a method for supervisedOff-Policy ranking that learns a policy scoring model by correctly ranking training policies withknown performance.
References
SHOWING 1-10 OF 51 REFERENCES
Keep Doing What Worked: Behavioral Modelling Priors for Offline Reinforcement Learning
- Computer ScienceICLR
- 2020
This paper admits the use of data generated by arbitrary behavior policies and uses a learned prior -- the advantage-weighted behavior model (ABM) -- to bias the RL policy towards actions that have previously been executed and are likely to be successful on the new task.
DualDICE: Behavior-Agnostic Estimation of Discounted Stationary Distribution Corrections
- Computer ScienceNeurIPS
- 2019
This work proposes an algorithm, DualDICE, that is agnostic to knowledge of the behavior policy (or policies) used to generate the dataset and improves accuracy compared to existing techniques.
Objective Mismatch in Model-based Reinforcement Learning
- Computer ScienceL4DC
- 2020
It is demonstrated that the likelihood of one-step ahead predictions is not always correlated with control performance, a critical limitation in the standard MBRL framework which will require further research to be fully understood and addressed.
Off-policy Model-based Learning under Unknown Factored Dynamics
- Computer ScienceICML
- 2015
The G-SCOPE algorithm is introduced that evaluates a new policy based on data generated by the existing policy and is both computationally and sample efficient because it greedily learns to exploit factored structure in the dynamics of the environment.
Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation
- Computer ScienceAAAI
- 2017
Two bootstrapping off-policy evaluation methods which use learned MDP transition models in order to estimate lower confidence bounds on policy performance with limited data are proposed.
Model-Ensemble Trust-Region Policy Optimization
- Computer ScienceICLR
- 2018
This paper analyzes the behavior of vanilla model-based reinforcement learning methods when deep neural networks are used to learn both the model and the policy, and shows that the learned policy tends to exploit regions where insufficient data is available for the model to be learned, causing instability in training.
Doubly Robust Policy Evaluation and Learning
- Computer ScienceICML
- 2011
It is proved that the doubly robust approach uniformly improves over existing techniques, achieving both lower variance in value estimation and better policies, and is expected to become common practice.
Eligibility Traces for Off-Policy Policy Evaluation
- Computer ScienceICML
- 2000
This paper considers the off-policy version of the policy evaluation problem, for which only one eligibility trace algorithm is known, a Monte Carlo method, and analyzes and compares this and four new eligibility trace algorithms, emphasizing their relationships to the classical statistical technique known as importance sampling.
Batch Policy Learning under Constraints
- Computer ScienceICML
- 2019
A new and simple method for off-policy policy evaluation (OPE) and derive PAC-style bounds is proposed and achieves strong empirical results in different domains, including in a challenging problem of simulated car driving subject to multiple constraints such as lane keeping and smooth driving.
Batch Stationary Distribution Estimation
- Computer Science, MathematicsICML
- 2020
A variational power method (VPM) is developed that provides provably consistent estimates under general conditions and is found to yields significantly better estimates across a range of problems, including queueing, stochastic differential equations, post-processing MCMC, and off-policy evaluation.