# A Unified Algorithm for Stochastic Path Problems

@article{Dann2022AUA, title={A Unified Algorithm for Stochastic Path Problems}, author={Christoph Dann and Chen-Yu Wei and Julian Zimmert}, journal={ArXiv}, year={2022}, volume={abs/2210.09255} }

We study reinforcement learning in stochastic path (SP) problems. The goal in these problems is to maximize the expected sum of rewards until the agent reaches a terminal state. We provide the ﬁrst regret guarantees in this general problem by analyzing a simple optimistic algorithm. Our regret bound matches the best known results for the well-studied special case of stochastic shortest path (SSP) with all non-positive rewards. For SSP, we present an adaptation procedure for the case when the…

## References

SHOWING 1-10 OF 28 REFERENCES

### Near-optimal Regret Bounds for Stochastic Shortest Path

- Computer ScienceICML
- 2020

This work gives an algorithm that guarantees a regret bound of $\widetilde{O}(B_\star |S| \sqrt{|A| K})$ and shows that any learning algorithm must have at least $\Omega$ regret in the worst case.

### Policy Optimization for Stochastic Shortest Path

- Computer ScienceCOLT
- 2022

This work begins the study of policy optimization for the stochastic shortest path (SSP) problem, a goal-oriented reinforcement learning model that strictly generalizes the finite-horizon model and better captures many applications.

### Minimax Regret for Stochastic Shortest Path

- Computer ScienceNeurIPS
- 2021

An algorithm is provided for the finite-horizon setting whose leading term in the regret depends polynomially on the expected cost of the optimal policy and only logarithmically on the horizon and this algorithm is based on a novel reduction from SSP to finite-Horizon MDPs.

### Online Learning for Stochastic Shortest Path Model via Posterior Sampling

- Computer ScienceArXiv
- 2021

This work proposes PSRL-SSP, a simple posterior sampling-based reinforcement learning algorithm for the SSP problem, which is the first such posterior sampling algorithm and outperforms numerically previously proposed optimism-based algorithms.

### Stochastic Shortest Path: Minimax, Parameter-Free and Towards Horizon-Free Regret

- Computer ScienceNeurIPS
- 2021

It is proved that EB-SSP achieves the minimax regret rate, the first horizon-free regret bound beyond the finite-horizon MDP setting, by closing the gap with the lower bound.

### Improved No-Regret Algorithms for Stochastic Shortest Path with Linear MDP

- Computer ScienceICML
- 2022

We introduce two new no-regret algorithms for the stochastic shortest path (SSP) problem with a linear MDP that signiﬁcantly improve over the only existing results of (Vial et al., 2021). Our ﬁrst…

### Learning Stochastic Shortest Path with Linear Function Approximation

- Computer ScienceICML
- 2022

A novel algorithm with Hoeffding-type conﬁdence sets for learning the linear mixture SSP, which provably achieves an near-optimal regret guarantee and proves a lower bound of Ω( dB (cid:63) √ K ) .

### Near-Optimal Goal-Oriented Reinforcement Learning in Non-Stationary Environments

- Computer ScienceArXiv
- 2022

A lower bound is established for dynamic regret minimization for goal-oriented reinforcement learning modeled by a non-stationary stochastic shortest path problem with changing cost and transition functions and algorithms are developed that estimate costs and transitions separately.

### No-Regret Exploration in Goal-Oriented Reinforcement Learning

- Computer ScienceICML
- 2020

UC-SSP is introduced, the first no-regret algorithm in this setting, and a regret bound scaling is proved as $\displaystyle \widetilde{\mathcal{O}}( D S \sqrt{ A D K})$ after any unknown SSP with $S$ states, $A$ actions, positive costs and SSP-diameter $D$, defined as the smallest expected hitting time from any starting state to the goal.

### Reinforcement Learning with Trajectory Feedback

- Computer ScienceAAAI
- 2021

This work extends reinforcement learning algorithms to this setting, based on least-squares estimation of the unknown reward, for both the known and unknown transition model cases, and study the performance of these algorithms by analyzing their regret.