• Corpus ID: 212628535

Active Model Estimation in Markov Decision Processes

@inproceedings{Tarbouriech2020ActiveME,
  title={Active Model Estimation in Markov Decision Processes},
  author={Jean Tarbouriech and Shubhanshu Shekhar and Matteo Pirotta and Mohammad Ghavamzadeh and Alessandro Lazaric},
  booktitle={UAI},
  year={2020}
}
We study the problem of efficient exploration in order to learn an accurate model of an environment, modeled as a Markov decision process (MDP). Efficient exploration in this problem requires the agent to identify the regions in which estimating the model is more difficult and then exploit this knowledge to collect more samples there. In this paper, we formalize this problem, introduce the first algorithm to learn an $\epsilon$-accurate estimate of the dynamics, and provide its sample… 
Improved Sample Complexity for Incremental Autonomous Exploration in MDPs
TLDR
A novel model-based approach that interleaves discovering new states from s0 and improving the accuracy of a model estimate that is used to compute goal-conditioned policies is introduced and is the first algorithm that can return an "/cmin-optimal policy for any cost-sensitive shortest-path problem defined on the L-reachable states with minimum cost cmin.
A Policy Gradient Method for Task-Agnostic Exploration
TLDR
It is argued that the entropy of the state distribution induced by limited-horizon trajectories is a sensible target, and a novel and practical policy-search algorithm, Maximum Entropy POLicy optimization (MEPOL), is presented to learn a policy that maximizes a non-parametric, $k$-nearest neighbors estimate of thestate distribution entropy.
Adaptive Multi-Goal Exploration
TLDR
It is shown how AdaGoal can be used to tackle the objective of learning an ε-optimal goal-conditioned policy for all the goal states that are reachable within L steps in expectation from a reference state s0 in a reward-free Markov decision process.
A Provably Efficient Sample Collection Strategy for Reinforcement Learning
TLDR
This paper derives an algorithm that requires $\tilde{O}( B D + D^{3/2} S^2 A)$ time steps to collect the b(s,a) desired samples, in any unknown and communicating MDP with S states, A actions and diameter $D$.
Task-Agnostic Exploration via Policy Gradient of a Non-Parametric State Entropy Estimate
TLDR
It is argued that the entropy of the state distribution induced by finite-horizon trajectories is a sensible target, and a novel and practical policy-search algorithm, Maximum Entropy POLicy optimization (MEPOL), is presented to learn a policy that maximizes a non-parametric, k-nearest neighbors estimate of thestate distribution entropy.
On Reward-Free Reinforcement Learning with Linear Function Approximation
TLDR
An algorithm for reward-free RL in the linear Markov decision process setting where both the transition and the reward admit linear representations is given, and the sample complexity is polynomial in the feature dimension and the planning horizon, and is completely independent of the number of states and actions.

References

SHOWING 1-10 OF 27 REFERENCES
Active Exploration in Markov Decision Processes
TLDR
A novel learning algorithm is introduced to solve the active exploration problem in Markov decision processes showing that active exploration in MDPs may be significantly more difficult than in MAB.
Provably Efficient Maximum Entropy Exploration
TLDR
This work studies a broad class of objectives that are defined solely as functions of the state-visitation frequencies that are induced by how the agent behaves, and provides an efficient algorithm to optimize such intrinsically defined objectives, when given access to a black box planning oracle.
Active Learning of MDP Models
TLDR
The proposal is to cast the active learning task as a utility maximization problem using Bayesian reinforcement learning with belief-dependent rewards using a simple algorithm to approximately solve this optimization problem.
An analysis of model-based Interval Estimation for Markov Decision Processes
TLDR
A theoretical analysis of Model-based Interval Estimation and a new variation called MBIE-EB are presented, proving their efficiency even under worst-case conditions.
Exploration-Exploitation Trade-off in Reinforcement Learning on Online Markov Decision Processes with Global Concave Rewards
TLDR
A no-regret algorithm based on online convex optimization tools and a novel gradient threshold procedure, which carefully controls the switches among actions to handle the subtle trade-off in alternating among different actions for balancing the vectorial outcomes.
Markov Decision Processes: Discrete Stochastic Dynamic Programming
  • M. Puterman
  • Mathematics, Computer Science
    Wiley Series in Probability and Statistics
  • 1994
TLDR
Markov Decision Processes covers recent research advances in such areas as countable state space models with average reward criterion, constrained models, and models with risk sensitive optimality criteria, and explores several topics that have received little or no attention in other books.
Adaptive Sampling for Estimating Probability Distributions
TLDR
The techniques developed in the paper can be easily extended to learn some classes of continuous distributions as well as to the related setting of minimizing the average error (rather than the maximum error) in learning a set of distributions.
Adaptive Sampling for Estimating Multiple Probability Distributions
TLDR
The techniques developed in the paper can be easily extended to the related setting of minimizing the average error (in terms of the four distances) in learning a set of distributions.
Near-Optimal Reinforcement Learning in Polynomial Time
TLDR
New algorithms for reinforcement learning are presented and it is proved that they have polynomial bounds on the resources required to achieve near-optimal return in general Markov decision processes.
Tuning Bandit Algorithms in Stochastic Environments
TLDR
A variant of the basic algorithm for the stochastic, multi-armed bandit problem that takes into account the empirical variance of the different arms is considered and for the first time the concentration of the regret is analyzed.
...
1
2
3
...