# Active Model Estimation in Markov Decision Processes

@inproceedings{Tarbouriech2020ActiveME, title={Active Model Estimation in Markov Decision Processes}, author={Jean Tarbouriech and Shubhanshu Shekhar and Matteo Pirotta and Mohammad Ghavamzadeh and Alessandro Lazaric}, booktitle={UAI}, year={2020} }

We study the problem of efficient exploration in order to learn an accurate model of an environment, modeled as a Markov decision process (MDP). Efficient exploration in this problem requires the agent to identify the regions in which estimating the model is more difficult and then exploit this knowledge to collect more samples there. In this paper, we formalize this problem, introduce the first algorithm to learn an $\epsilon$-accurate estimate of the dynamics, and provide its sample…

## 6 Citations

Improved Sample Complexity for Incremental Autonomous Exploration in MDPs

- Computer Science, MathematicsNeurIPS
- 2020

A novel model-based approach that interleaves discovering new states from s0 and improving the accuracy of a model estimate that is used to compute goal-conditioned policies is introduced and is the first algorithm that can return an "/cmin-optimal policy for any cost-sensitive shortest-path problem defined on the L-reachable states with minimum cost cmin.

A Policy Gradient Method for Task-Agnostic Exploration

- Computer Science, MathematicsArXiv
- 2020

It is argued that the entropy of the state distribution induced by limited-horizon trajectories is a sensible target, and a novel and practical policy-search algorithm, Maximum Entropy POLicy optimization (MEPOL), is presented to learn a policy that maximizes a non-parametric, $k$-nearest neighbors estimate of thestate distribution entropy.

Adaptive Multi-Goal Exploration

- Computer ScienceArXiv
- 2021

It is shown how AdaGoal can be used to tackle the objective of learning an ε-optimal goal-conditioned policy for all the goal states that are reachable within L steps in expectation from a reference state s0 in a reward-free Markov decision process.

A Provably Efficient Sample Collection Strategy for Reinforcement Learning

- Computer Science, MathematicsArXiv
- 2020

This paper derives an algorithm that requires $\tilde{O}( B D + D^{3/2} S^2 A)$ time steps to collect the b(s,a) desired samples, in any unknown and communicating MDP with S states, A actions and diameter $D$.

Task-Agnostic Exploration via Policy Gradient of a Non-Parametric State Entropy Estimate

- Computer ScienceAAAI
- 2021

It is argued that the entropy of the state distribution induced by finite-horizon trajectories is a sensible target, and a novel and practical policy-search algorithm, Maximum Entropy POLicy optimization (MEPOL), is presented to learn a policy that maximizes a non-parametric, k-nearest neighbors estimate of thestate distribution entropy.

On Reward-Free Reinforcement Learning with Linear Function Approximation

- Computer Science, MathematicsNeurIPS
- 2020

An algorithm for reward-free RL in the linear Markov decision process setting where both the transition and the reward admit linear representations is given, and the sample complexity is polynomial in the feature dimension and the planning horizon, and is completely independent of the number of states and actions.

## References

SHOWING 1-10 OF 27 REFERENCES

Active Exploration in Markov Decision Processes

- Mathematics, Computer ScienceAISTATS
- 2019

A novel learning algorithm is introduced to solve the active exploration problem in Markov decision processes showing that active exploration in MDPs may be significantly more difficult than in MAB.

Provably Efficient Maximum Entropy Exploration

- Computer Science, MathematicsICML
- 2019

This work studies a broad class of objectives that are defined solely as functions of the state-visitation frequencies that are induced by how the agent behaves, and provides an efficient algorithm to optimize such intrinsically defined objectives, when given access to a black box planning oracle.

Active Learning of MDP Models

- Computer ScienceEWRL
- 2011

The proposal is to cast the active learning task as a utility maximization problem using Bayesian reinforcement learning with belief-dependent rewards using a simple algorithm to approximately solve this optimization problem.

An analysis of model-based Interval Estimation for Markov Decision Processes

- Computer ScienceJ. Comput. Syst. Sci.
- 2008

A theoretical analysis of Model-based Interval Estimation and a new variation called MBIE-EB are presented, proving their efficiency even under worst-case conditions.

Exploration-Exploitation Trade-off in Reinforcement Learning on Online Markov Decision Processes with Global Concave Rewards

- Computer Science, MathematicsArXiv
- 2019

A no-regret algorithm based on online convex optimization tools and a novel gradient threshold procedure, which carefully controls the switches among actions to handle the subtle trade-off in alternating among different actions for balancing the vectorial outcomes.

Markov Decision Processes: Discrete Stochastic Dynamic Programming

- Mathematics, Computer ScienceWiley Series in Probability and Statistics
- 1994

Markov Decision Processes covers recent research advances in such areas as countable state space models with average reward criterion, constrained models, and models with risk sensitive optimality criteria, and explores several topics that have received little or no attention in other books.

Adaptive Sampling for Estimating Probability Distributions

- Computer ScienceICML
- 2020

The techniques developed in the paper can be easily extended to learn some classes of continuous distributions as well as to the related setting of minimizing the average error (rather than the maximum error) in learning a set of distributions.

Adaptive Sampling for Estimating Multiple Probability Distributions

- Computer Science, MathematicsArXiv
- 2019

The techniques developed in the paper can be easily extended to the related setting of minimizing the average error (in terms of the four distances) in learning a set of distributions.

Near-Optimal Reinforcement Learning in Polynomial Time

- Computer Science, MathematicsMachine Learning
- 2004

New algorithms for reinforcement learning are presented and it is proved that they have polynomial bounds on the resources required to achieve near-optimal return in general Markov decision processes.

Tuning Bandit Algorithms in Stochastic Environments

- Mathematics, Computer ScienceALT
- 2007

A variant of the basic algorithm for the stochastic, multi-armed bandit problem that takes into account the empirical variance of the different arms is considered and for the first time the concentration of the regret is analyzed.