A Unified Algorithm for Stochastic Path Problems

  title={A Unified Algorithm for Stochastic Path Problems},
  author={Christoph Dann and Chen-Yu Wei and Julian Zimmert},
We study reinforcement learning in stochastic path (SP) problems. The goal in these problems is to maximize the expected sum of rewards until the agent reaches a terminal state. We provide the first regret guarantees in this general problem by analyzing a simple optimistic algorithm. Our regret bound matches the best known results for the well-studied special case of stochastic shortest path (SSP) with all non-positive rewards. For SSP, we present an adaptation procedure for the case when the… 

Tables from this paper



Near-optimal Regret Bounds for Stochastic Shortest Path

This work gives an algorithm that guarantees a regret bound of $\widetilde{O}(B_\star |S| \sqrt{|A| K})$ and shows that any learning algorithm must have at least $\Omega$ regret in the worst case.

Policy Optimization for Stochastic Shortest Path

This work begins the study of policy optimization for the stochastic shortest path (SSP) problem, a goal-oriented reinforcement learning model that strictly generalizes the finite-horizon model and better captures many applications.

Minimax Regret for Stochastic Shortest Path

An algorithm is provided for the finite-horizon setting whose leading term in the regret depends polynomially on the expected cost of the optimal policy and only logarithmically on the horizon and this algorithm is based on a novel reduction from SSP to finite-Horizon MDPs.

Online Learning for Stochastic Shortest Path Model via Posterior Sampling

This work proposes PSRL-SSP, a simple posterior sampling-based reinforcement learning algorithm for the SSP problem, which is the first such posterior sampling algorithm and outperforms numerically previously proposed optimism-based algorithms.

Stochastic Shortest Path: Minimax, Parameter-Free and Towards Horizon-Free Regret

It is proved that EB-SSP achieves the minimax regret rate, the first horizon-free regret bound beyond the finite-horizon MDP setting, by closing the gap with the lower bound.

Improved No-Regret Algorithms for Stochastic Shortest Path with Linear MDP

We introduce two new no-regret algorithms for the stochastic shortest path (SSP) problem with a linear MDP that significantly improve over the only existing results of (Vial et al., 2021). Our first

Learning Stochastic Shortest Path with Linear Function Approximation

A novel algorithm with Hoeffding-type confidence sets for learning the linear mixture SSP, which provably achieves an near-optimal regret guarantee and proves a lower bound of Ω( dB (cid:63) √ K ) .

Near-Optimal Goal-Oriented Reinforcement Learning in Non-Stationary Environments

A lower bound is established for dynamic regret minimization for goal-oriented reinforcement learning modeled by a non-stationary stochastic shortest path problem with changing cost and transition functions and algorithms are developed that estimate costs and transitions separately.

No-Regret Exploration in Goal-Oriented Reinforcement Learning

UC-SSP is introduced, the first no-regret algorithm in this setting, and a regret bound scaling is proved as $\displaystyle \widetilde{\mathcal{O}}( D S \sqrt{ A D K})$ after any unknown SSP with $S$ states, $A$ actions, positive costs and SSP-diameter $D$, defined as the smallest expected hitting time from any starting state to the goal.

Reinforcement Learning with Trajectory Feedback

This work extends reinforcement learning algorithms to this setting, based on least-squares estimation of the unknown reward, for both the known and unknown transition model cases, and study the performance of these algorithms by analyzing their regret.