A Unified Algorithm for Stochastic Path Problems
@article{Dann2022AUA, title={A Unified Algorithm for Stochastic Path Problems}, author={Christoph Dann and Chen-Yu Wei and Julian Zimmert}, journal={ArXiv}, year={2022}, volume={abs/2210.09255} }
We study reinforcement learning in stochastic path (SP) problems. The goal in these problems is to maximize the expected sum of rewards until the agent reaches a terminal state. We provide the first regret guarantees in this general problem by analyzing a simple optimistic algorithm. Our regret bound matches the best known results for the well-studied special case of stochastic shortest path (SSP) with all non-positive rewards. For SSP, we present an adaptation procedure for the case when the…
References
SHOWING 1-10 OF 28 REFERENCES
Near-optimal Regret Bounds for Stochastic Shortest Path
- Computer ScienceICML
- 2020
This work gives an algorithm that guarantees a regret bound of $\widetilde{O}(B_\star |S| \sqrt{|A| K})$ and shows that any learning algorithm must have at least $\Omega$ regret in the worst case.
Policy Optimization for Stochastic Shortest Path
- Computer ScienceCOLT
- 2022
This work begins the study of policy optimization for the stochastic shortest path (SSP) problem, a goal-oriented reinforcement learning model that strictly generalizes the finite-horizon model and better captures many applications.
Minimax Regret for Stochastic Shortest Path
- Computer ScienceNeurIPS
- 2021
An algorithm is provided for the finite-horizon setting whose leading term in the regret depends polynomially on the expected cost of the optimal policy and only logarithmically on the horizon and this algorithm is based on a novel reduction from SSP to finite-Horizon MDPs.
Online Learning for Stochastic Shortest Path Model via Posterior Sampling
- Computer ScienceArXiv
- 2021
This work proposes PSRL-SSP, a simple posterior sampling-based reinforcement learning algorithm for the SSP problem, which is the first such posterior sampling algorithm and outperforms numerically previously proposed optimism-based algorithms.
Stochastic Shortest Path: Minimax, Parameter-Free and Towards Horizon-Free Regret
- Computer ScienceNeurIPS
- 2021
It is proved that EB-SSP achieves the minimax regret rate, the first horizon-free regret bound beyond the finite-horizon MDP setting, by closing the gap with the lower bound.
Improved No-Regret Algorithms for Stochastic Shortest Path with Linear MDP
- Computer ScienceICML
- 2022
We introduce two new no-regret algorithms for the stochastic shortest path (SSP) problem with a linear MDP that significantly improve over the only existing results of (Vial et al., 2021). Our first…
Learning Stochastic Shortest Path with Linear Function Approximation
- Computer ScienceICML
- 2022
A novel algorithm with Hoeffding-type confidence sets for learning the linear mixture SSP, which provably achieves an near-optimal regret guarantee and proves a lower bound of Ω( dB (cid:63) √ K ) .
Near-Optimal Goal-Oriented Reinforcement Learning in Non-Stationary Environments
- Computer ScienceArXiv
- 2022
A lower bound is established for dynamic regret minimization for goal-oriented reinforcement learning modeled by a non-stationary stochastic shortest path problem with changing cost and transition functions and algorithms are developed that estimate costs and transitions separately.
No-Regret Exploration in Goal-Oriented Reinforcement Learning
- Computer ScienceICML
- 2020
UC-SSP is introduced, the first no-regret algorithm in this setting, and a regret bound scaling is proved as $\displaystyle \widetilde{\mathcal{O}}( D S \sqrt{ A D K})$ after any unknown SSP with $S$ states, $A$ actions, positive costs and SSP-diameter $D$, defined as the smallest expected hitting time from any starting state to the goal.
Reinforcement Learning with Trajectory Feedback
- Computer ScienceAAAI
- 2021
This work extends reinforcement learning algorithms to this setting, based on least-squares estimation of the unknown reward, for both the known and unknown transition model cases, and study the performance of these algorithms by analyzing their regret.