• Corpus ID: 211133193

Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning

  title={Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning},
  author={Alberto Maria Metelli and Flavio Mazzolini and Lorenzo Bisi and Luca Sabbioni and Marcello Restelli},
The choice of the control frequency of a system has a relevant impact on the ability of reinforcement learning algorithms to learn a highly performing policy. In this paper, we introduce the notion of action persistence that consists in the repetition of an action for a fixed number of decision steps, having the effect of modifying the control frequency. We start analyzing how action persistence affects the performance of the optimal policy, and then we present a novel algorithm, Persistent… 

Figures and Tables from this paper

Simultaneously Updating All Persistence Values in Reinforcement Learning

This work derives a novel All-Persistence Bellman Operator, which allows an effective use of both the low-persistence experience, by decomposition into sub-transition, and the high-persistent experience, thanks to the introduction of a suitable bootstrap procedure.

Reinforcement Learning for Control with Multiple Frequencies

The proposed method, Action-Persistent Policy Iteration (AP-PI), provides a theoretical guarantee on the convergence to an optimal solution while incurring only a factor of |A| increase in time complexity during policy improvement step, compared to the standard policy iteration for FA-MDPs.

Configurable Environments in Reinforcement Learning: An Overview

An overview of the main aspects of environment configurability is provided and the formalism of the Configurable Markov Decision Processes (Conf-MDPs) is introduced and the solutions concepts are illustrated.

Managing Temporal Resolution in Continuous Value Estimation: A Fundamental Trade-off

By analyzing Monte-Carlo value estimation for LQR systems in both finite-horizon and infinite-Horizon settings, this work uncovers a fundamental trade-off between approximation and statistical error in value estimation.

Addressing Non-Stationarity in FX Trading with Online Model Selection of Offline RL Experts

This work proposes a method for the dynamic selection of the best RL agent which is only driven by profit performance, and employs two state-of-the-art algorithms: Fitted Q-Iteration for the RL layer and Optimistic Adapt ML-Prod for the online learning one.

Exploiting environment configurability in reinforcement learning

Delayed Reinforcement Learning by Imitation

A novel algorithm, Delayed Imitation with Dataset Aggregation (DIDA), which builds upon imitation learning methods to learn how to act in a delayed environment from undelayed demonstrations, and shows empirically that DIDA obtains high performances with a remarkable sample efficiency on a variety of tasks.

Learning FX trading strategies with FQI and persistent actions

Automated Trading Systems are constantly increasing their impact on financial markets, but learning from historical data, detecting interesting patterns and producing profitable strategies are still

Time Discretization-Invariant Safe Action Repetition for Policy Gradient Methods

This work identifies the underlying reasons that cause PG methods to fail as δ → 0, and proves that the variance of the PG estimator can diverge to infinity in stochastic environments under a certain assumption of stoChasticity.

Towards Automatic Actor-Critic Solutions to Continuous Control

This paper creates an evolutionary approach that automatically tunes these design decisions and eliminates the RL-specific hyperparameters from the Soft ActorCritic algorithm, and shows that this agent outperforms well-tuned hyperparameter settings in popular benchmarks from the DeepMind Control Suite.



Reinforcement Learning in Continuous Time and Space

  • K. Doya
  • Computer Science
    Neural Computation
  • 2000
This article presents a reinforcement learning framework for continuous-time dynamical systems without a priori discretization of time, state, and action. Basedonthe Hamilton-Jacobi-Bellman (HJB)

Regularization in reinforcement learning

It is proved that the regularization-based Approximate Value/Policy Iteration algorithms introduced in this thesis enjoys an oracle-like property and it may be used to achieve adaptivity: the performance is almost as good as the performance of the unknown best parameters.

Markov Decision Processes: Discrete Stochastic Dynamic Programming

  • M. Puterman
  • Computer Science
    Wiley Series in Probability and Statistics
  • 1994
Markov Decision Processes covers recent research advances in such areas as countable state space models with average reward criterion, constrained models, and models with risk sensitive optimality criteria, and explores several topics that have received little or no attention in other books.

Neuronlike adaptive elements that can solve difficult learning control problems

It is shown how a system consisting of two neuronlike adaptive elements can solve a difficult learning control problem and the relation of this work to classical and instrumental conditioning in animal learning studies and its possible implications for research in the neurosciences.

Tree-Based Batch Mode Reinforcement Learning

Within this framework, several classical tree-based supervised learning methods and two newly proposed ensemble algorithms, namely extremely and totally randomized trees, are described and found that the ensemble methods based on regression trees perform well in extracting relevant information about the optimal control policy from sets of four-tuples.

Policy gradient in Lipschitz Markov Decision Processes

This paper shows that both the expected return of a policy and its gradient are Lipschitz continuous w.r.t. policy parameters and defines policy-parameter updates that guarantee a performance improvement at each iteration.

Reinforcement learning for robotic manipulation using simulated locomotion demonstrations

This paper introduces a framework whereby an object locomotion policy is initially obtained using a realistic physics simulator, and this policy is then used to generate auxiliary rewards, called simulated locomotion demonstration rewards (SLDRs), which enable us to learn the robot manipulation policy.

Assistive Gym: A Physics Simulation Framework for Assistive Robotics

Assistive Gym is presented, an open source physics simulation framework for assistive robots that models multiple tasks and demonstrates that modeling human motion results in better assistance and compares the performance of different robots.

Reinforcement Learning in Configurable Continuous Environments

This paper proposes a trust-region method, Relative Entropy Model Policy Search (REMPS), able to learn both the policy and the MDP configuration in continuous domains without requiring the knowledge of the true model of the environment.

Integral Probability Metrics and Their Generating Classes of Functions

  • A. Müller
  • Mathematics, Computer Science
    Advances in Applied Probability
  • 1997
A unified study of integral probability metrics of the following type are given and how some interesting properties of these probability metrics arise directly from conditions on the generating class of functions is shown.