• Corpus ID: 237397798

The Importance of Non-Markovianity in Maximum State Entropy Exploration

  title={The Importance of Non-Markovianity in Maximum State Entropy Exploration},
  author={Mirco Mutti and Ric De Santi and Marcello Restelli},
In the maximum state entropy exploration frame-work, an agent interacts with a reward-free environment to learn a policy that maximizes the entropy of the expected state visitations it is inducing. Hazan et al. (2019) noted that the class of Markovian stochastic policies is sufficient for the maximum state entropy objective, and exploiting non-Markovianity is generally considered pointless in this setting. In this paper, we argue that non-Markovianity is instead paramount for maximum state… 

Figures and Tables from this paper

Non-Markovian Policies for Unsupervised Reinforcement Learning in Multiple Environments

This work proposes a novel non-Markovian policy architecture to be pre-trained with the common maximum state entropy objective and showcases significant empirical advantages when compared to state-of-the-art Markovian agents for URL.

Offline Estimation of Controlled Markov Chains: Minimax Nonparametric Estimators and Sample Efficiency

This work considers the estimation of the transition probabilities of a finite-state finite-control CMC, and develops a minimax sample complexity bounds for nonparametric estimation of these transition probability matrices.

Challenging Common Assumptions in Convex Reinforcement Learning

The classic Reinforcement Learning (RL) formulation concerns the maximization of a scalar reward function. More recently, convex RL has been introduced to extend the RL formulation to all the



Task-Agnostic Exploration via Policy Gradient of a Non-Parametric State Entropy Estimate

It is argued that the entropy of the state distribution induced by finite-horizon trajectories is a sensible target, and a novel and practical policy-search algorithm, Maximum Entropy POLicy optimization (MEPOL), is presented to learn a policy that maximizes a non-parametric, $k$-nearest neighbors estimate of thestate distribution entropy.

Provably Efficient Maximum Entropy Exploration

This work studies a broad class of objectives that are defined solely as functions of the state-visitation frequencies that are induced by how the agent behaves, and provides an efficient algorithm to optimize such intrinsically defined objectives, when given access to a black box planning oracle.

Exploration-Exploitation Trade-off in Reinforcement Learning on Online Markov Decision Processes with Global Concave Rewards

A no-regret algorithm based on online convex optimization tools and a novel gradient threshold procedure, which carefully controls the switches among actions to handle the subtle trade-off in alternating among different actions for balancing the vectorial outcomes.

RL for Latent MDPs: Regret Guarantees and a Lower Bound

This work considers the regret minimization problem for reinforcement learning in latent Markov Decision Processes (LMDP) and shows that the key link is a notion of separation between the MDP system dynamics, providing an efficient algorithm with local guarantee.

k-Means Maximum Entropy Exploration

This work introduces an artificial curiosity algorithm based on lower bounding an approximation to the entropy of the state visitation distribution that is competitive on benchmarks for exploration in high-dimensional, continuous spaces, especially on tasks where reinforcement learning algorithms are unable to create rewards.

An Intrinsically-Motivated Approach for Learning Highly Exploring and Fast Mixing Policies

This paper proposes a novel surrogate objective for learning highly exploring and fast mixing policies, which focuses on maximizing a lower bound to the entropy of the steady-state distribution induced by the policy.

Non-Markovian policies occupancy measures

A central object of study in Reinforcement Learning (RL) is the Markovian policy, in which an agent’s actions are chosen from a memoryless probability distribution, conditioned only on its current

Exploration by Maximizing Renyi Entropy for Reward-Free RL Framework

A reward-free RL framework is considered that completely separates exploration from exploitation and brings new challenges for exploration algorithms, which results in superior policies for arbitrary reward functions in the planning phase.

Geometric Entropic Exploration

Geometric Entropy Maximisation (GEM) is introduced, a new algorithm that maximises the geometry-aware Shannon entropy of state-visits in both discrete and continuous domains and is shown to be efficient in solving complex Reinforcement Learning tasks with sparse rewards.

Active Exploration in Markov Decision Processes

A novel learning algorithm is introduced to solve the active exploration problem in Markov decision processes showing that active exploration in MDPs may be significantly more difficult than in MAB.