Policy space identification in configurable environments

@article{Metelli2019PolicySI,
  title={Policy space identification in configurable environments},
  author={Alberto Maria Metelli and Guglielmo Manneschi and Marcello Restelli},
  journal={Machine Learning},
  year={2019},
  volume={111},
  pages={2093 - 2145}
}
We study the problem of identifying the policy space available to an agent in a learning process, having access to a set of demonstrations generated by the agent playing the optimal policy in the considered space. We introduce an approach based on frequentist statistical testing to identify the set of policy parameters that the agent can control, within a larger parametric policy space. After presenting two identification rules (combinatorial and simplified), applicable under different… 

Configurable Environments in Reinforcement Learning: An Overview

An overview of the main aspects of environment configurability is provided and the formalism of the Configurable Markov Decision Processes (Conf-MDPs) is introduced and the solutions concepts are illustrated.

Learning in Non-Cooperative Configurable Markov Decision Processes

This paper introduces the Non-Cooperative Configurable Markov Decision Process, a framework that allows modeling two (possibly different) reward functions for the conflgurator and the agent, and proposes two learning algorithms to minimize the con⬁gurator’s expected regret.

Online Learning in Non-Cooperative Configurable Markov Decision Process

This paper proposes a learning algorithm to minimize the configurator expected regret, which exploits the structure of the problem and empirically shows the performance of the algorithm in simulated domains.

A unified view of configurable Markov Decision Processes: Solution concepts, value functions, and operators

  • A. Metelli
  • Computer Science, Mathematics
    Intelligenza Artificiale
  • 2022
This paper provides a unified presentation of the Configurable Markov Decision Process (Conf-MDP) framework and illustrates how to extend the traditional value functions for MDPs and Bellman operators to this new framework.

Research Project Proposal: A Non-Cooperative approach in Configurable Markov Decision Process

Reinforcement Learning (RL) studies how software agents ought to take actions in an environment in order to maximize a cumulative reward coherent with its goal.

State of the Art on: Configurable Markov Decision Process

Reinforcement learning (RL) is one of the three basic machine learning paradigms, alongside supervised learning and unsupervised learning, that leverages statistical techniques to develop algorithms able to learn from data.

Optimizing Empty Container Repositioning and Fleet Deployment via Configurable Semi-POMDPs

A novel framework, Configurable Semi-POMDPs, to model Empty Container Repositioning (ECR) problems is introduced and a two-stage learning algorithm, “Conflgure & Conquer” (CC), is provided, which is successful at optimizing both the ECR policy and the fleet of vessels, leading to superior performance in world trade environments.

Exploiting environment configurability in reinforcement learning

References

SHOWING 1-10 OF 58 REFERENCES

Configurable Markov Decision Processes

A novel framework is proposed, Configurable Markov Decision Processes (Conf-MDPs), to model this new type of interaction with the environment, and a new learning algorithm is provided, Safe Policy-Model Iteration (SPMI), to jointly and adaptively optimize the policy and the environment configuration.

Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path

A finite-sample, high-probability bound on the performance of the computed policy that depends on the mixing rate of the trajectory, the capacity of the function set as measured by a novel capacity concept, the approximation power of thefunction set and the controllability properties of the MDP is found.

A Theoretical and Algorithmic Analysis of Configurable MDPs

A complexity analysis on the problem from a computational perspective shows that, in general, solving a configurable MDP is NP-Hard, and formally derive a gradient-based approach that sheds some light on the correctness and limitations of existing methods.

What if the World Were Different? Gradient-Based Exploration for New Optimal Policies

An approach is presented that models feasible changes to the world as modifications to the probability transition function, and it is shown that the problem of computing the configuration of the world that allows the most rewarding optimal policy can be formulated as a constrained optimization problem.

A Survey on Policy Search for Robotics

This work classifies model-free methods based on their policy evaluation strategy, policy update strategy, and exploration strategy and presents a unified view on existing algorithms.

Importance Sampling Techniques for Policy Optimization

A class of model-free, policy search algorithms that extend the recent Policy Optimization via Importance Sampling by incorporating two advanced variance reduction techniques: per–decision and multiple importance sampling are proposed and analyzed.

PILCO: A Model-Based and Data-Efficient Approach to Policy Search

PILCO reduces model bias, one of the key problems of model-based reinforcement learning, in a principled way by learning a probabilistic dynamics model and explicitly incorporating model uncertainty into long-term planning.

Policy Optimization via Importance Sampling

This paper proposes a novel, model-free, policy search algorithm, POIS, applicable in both action-based and parameter-based settings, and defines a surrogate objective function, which is optimized offline whenever a new batch of trajectories is collected.

Compatible Reward Inverse Reinforcement Learning

A novel model-free IRL approach that, differently from most of the existing IRL algorithms, does not require to specify a function space where to search for the expert's reward function.

Inverse Reinforcement Learning through Policy Gradient Minimization

This paper proposes a new IRL approach that allows to recover the reward function without the need of solving any "direct" RL problem, and presents an empirical evaluation of the proposed approach on a multidimensional version of the Linear-Quadratic Regulator.
...