Criticality-based Varying Step-number Algorithm for Reinforcement Learning

@article{Spielberg2021CriticalitybasedVS,
  title={Criticality-based Varying Step-number Algorithm for Reinforcement Learning},
  author={Yitzhak Spielberg and Amos Azaria},
  journal={ArXiv},
  year={2021},
  volume={abs/2201.05034}
}
In the context of reinforcement learning we introduce the concept of criticality of a state, which indicates the extent to which the choice of action in that particular state influences the expected return. That is, a state in which the choice of action is more likely to influence the final outcome is considered as more critical than a state in which it is less likely to influence the final outcome. We formulate a criticality-based varying step number algorithm (CVS) — a flexible step number… 

Figures from this paper

References

SHOWING 1-10 OF 32 REFERENCES
The Concept of Criticality in Reinforcement Learning
TLDR
A practical application of criticality in reinforcement learning is formulated: the criticality-based varying stepnumber algorithm (CVS) - a flexible step number algorithm that utilizes thecriticality function, provided by a human, in order to avoid the problem of choosing an appropriate stepnumber in n-step algorithms such as n- step SARSA and n-Step Tree Backup.
Per-decision Multi-step Temporal Difference Learning with Control Variates
TLDR
The results show that including the control variates can greatly improve performance on both on and off-policy multi-step temporal difference learning tasks.
Deep Reinforcement Learning with Double Q-Learning
TLDR
This paper proposes a specific adaptation to the DQN algorithm and shows that the resulting algorithm not only reduces the observed overestimations, as hypothesized, but that this also leads to much better performance on several games.
Model-Based Reinforcement Learning for Time-Optimal Velocity Control
TLDR
This letter develops a model-based deep reinforcement learning to generate the time-optimal velocity control problem and introduces a method that uses a numerical solution that predicts whether the vehicle may become unstable and intervenes if needed.
An Emphatic Approach to the Problem of Off-policy Temporal-Difference Learning
TLDR
It is shown that varying the emphasis of linear TD(γ)'s updates in a particular way causes its expected update to become stable under off-policy training.
Deep TAMER: Interactive Agent Shaping in High-Dimensional State Spaces
TLDR
Deep TAMER is proposed, an extension of the TAMER framework that leverages the representational power of deep neural networks in order to learn complex tasks in just a short amount of time with a human trainer and demonstrated by using it and just 15 minutes of human-provided feedback to train an agent that performs better than humans on the Atari game of Bowling.
Deep Reinforcement Learning for Time Optimal Velocity Control using Prior Knowledge
TLDR
It is shown that the reinforcement learner outperforms the numerically derived solution, and that the hybrid approach (combining learning with the numerical solution) speeds up the training process.
Doubly Robust Off-policy Value Evaluation for Reinforcement Learning
TLDR
This work extends the doubly robust estimator for bandits to sequential decision-making problems, which gets the best of both worlds: it is guaranteed to be unbiased and can have a much lower variance than the popular importance sampling estimators.
Reinforcement Learning from Demonstration through Shaping
TLDR
This paper investigates the intersection of reinforcement learning and expert demonstrations, leveraging the theoretical guarantees provided by reinforcement learning, and using expert demonstrations to speed up this learning by biasing exploration through a process called reward shaping.
Policy Shaping: Integrating Human Feedback with Reinforcement Learning
TLDR
This paper introduces Advise, a Bayesian approach that attempts to maximize the information gained from human feedback by utilizing it as direct policy labels and shows that it can outperform state-of-the-art approaches and is robust to infrequent and inconsistent human feedback.
...
...