Corpus ID: 211677237

Batch Stationary Distribution Estimation

  title={Batch Stationary Distribution Estimation},
  author={Junfeng Wen and Bo Dai and Lihong Li and Dale Schuurmans},
We consider the problem of approximating the stationary distribution of an ergodic Markov chain given a set of sampled transitions. Classical simulation-based approaches assume access to the underlying process so that trajectories of sufficient length can be gathered to approximate stationary sampling. Instead, we consider an alternative setting where a fixed set of transitions has been collected beforehand, by a separate, possibly unknown procedure. The goal is still to estimate properties of… Expand
Autoregressive Dynamics Models for Offline Policy Evaluation and Optimization
It is demonstrated that autoregressive dynamics models indeed outperform standard feedforward models in log-likelihood on heldout transitions and are useful for offline policy optimization by serving as a way to enrich the replay buffer through data augmentation and improving performance using model-based planning. Expand
A maximum-entropy approach to off-policy evaluation in average-reward MDPs
This work provides the first finite-sample OPE error bound, and proposes a new approach for estimating stationary distributions with function approximation in infinite-horizon undiscounted Markov decision processes. Expand
Non-asymptotic Confidence Intervals of Off-policy Evaluation: Primal and Dual Bounds
This work considers the problem of constructing non-asymptotic confidence intervals in infinite-horizon off-policy evaluation, which remains a challenging open question and develops a practical algorithm through a primal-dual optimization-based approach, which leverages the kernel Bellman loss of Feng et al. (2019). Expand
Provably Good Batch Reinforcement Learning Without Great Exploration
Batch reinforcement learning (RL) is important to apply RL algorithms to many high stakes tasks. Doing batch RL in a way that yields a reliable new policy in large domains is challenging: a newExpand
Cautious Policy Programming: Exploiting KL Regularization in Monotonic Policy Improvement for Reinforcement Learning
A new entropyregularization-aware lower bound of policy improvement that only requires estimating the expected policy advantage function is derived and cautious policy programming (CPP) is proposed, a novel value-based reinforcement learning (RL) algorithm that can ensure monotonic policy improvement during learning. Expand
Benchmarks for Deep Off-Policy Evaluation
The goal of the benchmark is to provide a standardized measure of progress that is motivated from a set of principles designed to challenge and test the limits of existing OPE methods. Expand
Learning and Planning in Average-Reward Markov Decision Processes
Improved learning and planning algorithms for average-reward MDPs are introduced, including the first general proven-convergent off-policy model-free control algorithm without reference states, and the first learning algorithms that converge to the actual value function rather than to the value function plus an offset. Expand
Deep Reinforcement and InfoMax Learning
An objective based on Deep InfoMax (DIM) is introduced which trains the agent to predict the future by maximizing the mutual information between its internal representation of successive timesteps. Expand
Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems
In this tutorial article, we aim to provide the reader with the conceptual tools needed to get started on research on offline reinforcement learning algorithms: reinforcement learning algorithms thatExpand


Breaking the Curse of Horizon: Infinite-Horizon Off-Policy Estimation
A new off-policy estimation method that applies importance sampling directly on the stationary state-visitation distributions to avoid the exploding variance issue faced by existing estimators is proposed. Expand
DualDICE: Behavior-Agnostic Estimation of Discounted Stationary Distribution Corrections
This work proposes an algorithm, DualDICE, that is agnostic to knowledge of the behavior policy (or policies) used to generate the dataset and improves accuracy compared to existing techniques. Expand
Slice Sampling
Markov chain sampling methods that automatically adapt to characteristics of the distribution being sampled can be constructed by exploiting the principle that one can sample from a distribution byExpand
Variational Inference with Normalizing Flows
It is demonstrated that the theoretical advantages of having posteriors that better match the true posterior, combined with the scalability of amortized variational approaches, provides a clear improvement in performance and applicability of variational inference. Expand
Toward Minimax Off-policy Value Estimation
It is shown that while the so-called regression estimator is asymptotically optimal, for small sample sizes it may perform suboptimally compared to an ideal oracle up to a multiplicative factor that depends on the number of actions. Expand
The No-U-turn sampler: adaptively setting path lengths in Hamiltonian Monte Carlo
The No-U-Turn Sampler (NUTS), an extension to HMC that eliminates the need to set a number of steps L, and derives a method for adapting the step size parameter {\epsilon} on the fly based on primal-dual averaging. Expand
Bayesian Learning via Stochastic Gradient Langevin Dynamics
In this paper we propose a new framework for learning from large scale datasets based on iterative learning from small mini-batches. By adding the right amount of noise to a standard stochasticExpand
MCMC Using Hamiltonian Dynamics
Hamiltonian dynamics can be used to produce distant proposals for the Metropolis algorithm, thereby avoiding the slow exploration of the state space that results from the diffusive behaviour ofExpand
Markov chains and mixing times
For our purposes, a Markov chain is a (finite or countable) collection of states S and transition probabilities pij, where i, j ∈ S. We write P = [pij] for the matrix of transition probabilities.Expand
Control functionals for Monte Carlo integration
Summary A non-parametric extension of control variates is presented. These leverage gradient information on the sampling density to achieve substantial variance reduction. It is not required thatExpand