Near-Optimal Goal-Oriented Reinforcement Learning in Non-Stationary Environments

@article{Chen2022NearOptimalGR,
  title={Near-Optimal Goal-Oriented Reinforcement Learning in Non-Stationary Environments},
  author={Liyu Chen and Haipeng Luo},
  journal={ArXiv},
  year={2022},
  volume={abs/2205.13044}
}
We initiate the study of dynamic regret minimization for goal-oriented reinforcement learning modeled by a non-stationary stochastic shortest path problem with changing cost and transition functions. We start by establishing a lower bound Ω(( B ⋆ SAT ⋆ (∆ c + B 2 ⋆ ∆ P )) 1 / 3 K 2 / 3 ) , where B ⋆ is the maximum expected cost of the optimal policy of any episode starting from any state, T ⋆ is the maximum hitting time of the optimal policy of any episode starting from the initial state, SA is… 

Reaching Goals is Hard: Settling the Sample Complexity of the Stochastic Shortest Path

It is shown that there exists a worst-case SSP instance with S states, A actions, minimum cost c min, and maximum expected cost of the optimal policy over all states B ⋆, and it is proved that horizon-free regret is impossible in SSPs general an open system.

A Unified Algorithm for Stochastic Path Problems

The first regret guarantees in this general problem are provided by analyzing a simple optimistic algorithm and the regret bound matches the best known results for the well-studied special case of stochastic shortest path (SSP) with all non-positive rewards.

Nonstationary Reinforcement Learning with Linear Function Approximation

This work develops the first dynamic regret analysis in nonstationary reinforcement learning with function approximation in episodic Markov decision processes with linear function approximation under drifting environment and proposes a parameter-free algorithm that works without knowing the variation budgets but with a slightly worse dynamic regret bound.

References

SHOWING 1-10 OF 40 REFERENCES

Near-Optimal Model-Free Reinforcement Learning in Non-Stationary Episodic MDPs

The proposed Restarted Q-Learning with Upper Confidence Bounds (RestartQ-UCB), the first modelfree algorithm for non-stationary RL, is proposed, and it is shown that it outperforms existing solutions in terms of dynamic regret.

Is Reinforcement Learning More Difficult Than Bandits? A Near-optimal Algorithm Escaping the Curse of Horizon

The current paper shows that the long planning horizon and the unknown state-dependent transitions (at most) pose little additional difficulty on sample complexity, and improves the state-of-the-art polynomial-time algorithms.

Non-stationary Reinforcement Learning without Prior Knowledge: An Optimal Black-box Approach

We propose a black-box reduction that turns a certain reinforcement learning algorithm with optimal regret in a (near-)stationary environment into another algorithm with optimal dynamic regret in a

Minimax Regret for Stochastic Shortest Path

An algorithm is provided for the finite-horizon setting whose leading term in the regret depends polynomially on the expected cost of the optimal policy and only logarithmically on the horizon and this algorithm is based on a novel reduction from SSP to finite-Horizon MDPs.

Combinatorial Semi-Bandit in the Non-Stationary Environment

A parameter-free algorithm is designed that achieves nearly optimal regret both in the switching case and in the dynamic case without knowing the parameters in advance.

Bandit Algorithms

sets of environments and policies respectively and ` : E ×Π→ [0, 1] a bounded loss function. Given a policy π let `(π) = (`(ν1, π), . . . , `(νN , π)) be the loss vector resulting from policy π.

Test 2 is the same as the third test of Algorithm 4, which guards the magnitude

  • 2021

Policy Optimization for Stochastic Shortest Path

This work begins the study of policy optimization for the stochastic shortest path (SSP) problem, a goal-oriented reinforcement learning model that strictly generalizes the finite-horizon model and better captures many applications.

Improved No-Regret Algorithms for Stochastic Shortest Path with Linear MDP

We introduce two new no-regret algorithms for the stochastic shortest path (SSP) problem with a linear MDP that significantly improve over the only existing results of (Vial et al., 2021). Our first

Implicit Finite-Horizon Approximation and Efficient Optimal Algorithms for Stochastic Shortest Path

A generic template for developing regret minimization algorithms in the Stochastic Shortest Path (SSP) model is introduced, which achieves minimax optimal regret as long as certain properties are ensured and two new algorithms are developed, both computationally more efficient than all existing algorithms.