Corpus ID: 236429138

Model-Free Non-Stationary RL: Near-Optimal Regret and Applications in Multi-Agent RL and Inventory Control

  title={Model-Free Non-Stationary RL: Near-Optimal Regret and Applications in Multi-Agent RL and Inventory Control},
  author={Weichao Mao and Kaiqing Zhang and Ruihao Zhu and David Simchi-Levi and Tamer Bacsar},
We consider model-free reinforcement learning (RL) in non-stationary Markov decision processes. Both the reward functions and the state transition functions are allowed to vary arbitrarily over time as long as their cumulative variations do not exceed certain variation budgets. We propose Restarted Q-Learning with Upper Confidence Bounds (RestartQ-UCB), the first model-free algorithm for non-stationary RL, and show that it outperforms existing solutions in terms of dynamic regret. Specifically… Expand

Figures and Tables from this paper


Reinforcement Learning for Non-Stationary Markov Decision Processes: The Blessing of (More) Optimism
This work develops the Sliding Window Upper-Confidence bound for Reinforcement Learning with Confidence Widening (SWUCRL2-CW) algorithm, and establishes its dynamic regret bound when the variation budgets are known, and proposes the Bandit-over-Reinforcement Learning (BORL) algorithm to adaptively tune the SWUCRL1-CW algorithm to achieve the same dynamic regrets bound. Expand
Nonstationary Reinforcement Learning with Linear Function Approximation
This work develops the first dynamic regret analysis in nonstationary reinforcement learning with function approximation in episodic Markov decision processes with linear function approximation under drifting environment and proposes a parameter-free algorithm that works without knowing the variation budgets but with a slightly worse dynamic regret bound. Expand
Online Markov Decision Processes Under Bandit Feedback
It is shown that after T time steps, the expected regret of this algorithm (more precisely, a slightly modified version thereof) is O(T1/2lnT), giving the first rigorously proven, essentially tight regret bound for the problem. Expand
Near-optimal Regret Bounds for Reinforcement Learning
This work presents a reinforcement learning algorithm with total regret O(DS√AT) after T steps for any unknown MDP with S states, A actions per state, and diameter D, and proposes a new parameter: An MDP has diameter D if for any pair of states s,s' there is a policy which moves from s to s' in at most D steps. Expand
Dynamic Regret of Policy Optimization in Non-stationary Environments
This work proposes two model-free policy optimization algorithms, POWER and POWER++, and establishes guarantees for their dynamic regret, and shows that POWER++ improves over POWER on the second component of the dynamic regret by actively adapting to non-stationarity through prediction. Expand
Online algorithms for the multi-armed bandit problem with Markovian rewards
  • Cem Tekin, M. Liu
  • Computer Science, Mathematics
  • 2010 48th Annual Allerton Conference on Communication, Control, and Computing (Allerton)
  • 2010
It is shown that under certain conditions on the state transition probabilities of the arms, a sample mean based index policy achieves logarithmic regret uniformly over the total number of trials. Expand
Is Q-learning Provably Efficient?
Model-free reinforcement learning (RL) algorithms, such as Q-learning, directly parameterize and update value functions or policies without explicitly modeling the environment. They are typicallyExpand
Decentralized Q-Learning in Zero-sum Markov Games
A radically uncoupled Q-learning dynamics that is both rational and convergent: the dynamics converges to the best response to the opponent’s strategy when the opponent follows an asymptotically stationary strategy; the value function estimates converge to the payoffs at a Nash equilibrium when both agents adopt the dynamics. Expand
Stochastic Multi-Armed-Bandit Problem with Non-stationary Rewards
This paper fully characterize the (regret) complexity of this class of MAB problems by establishing a direct link between the extent of allowable reward "variation" and the minimal achievable regret, and by established a connection between the adversarial and the stochastic MAB frameworks. Expand
Learning in Structured MDPs with Convex Cost Functions: Improved Regret Bounds for Inventory Management
The main contribution is a learning algorithm with a regret bound of ~O (L√T+D) for the inventory control problem, significantly improving the previously best known regret bounds for this problem where the dependence on L was exponential and many further assumptions on demand distribution were required. Expand