• Corpus ID: 11871994

Relax but stay in control: from value to algorithms for online Markov decision processes

  title={Relax but stay in control: from value to algorithms for online Markov decision processes},
  author={Peng Guan and Maxim Raginsky and Rebecca M. Willett},
Online learning algorithms are designed to perform in non-stationary environments, but generally there is no notion of a dynamic state to model constraints on current and future actions as a function of past actions. State-based models are common in stochastic control settings, but commonly used frameworks such as Markov Decision Processes (MDPs) assume a known stationary environment. In recent years, there has been a growing interest in combining the above two frameworks and considering an MDP… 

From minimax value to low-regret algorithms for online Markov decision processes

This paper builds on recent results of Rakhlin et al. to give a general framework for deriving algorithms in an MDP setting with arbitrarily changing costs that leads to a unifying view of existing methods and provides a general procedure for constructing new ones.

Topics in Online Markov Decision Processes

Topics in Online Markov Decision Processes by Peng Guan Department of Electrical and Computer Engineering Duke University are topics for research and teaching.



Online Learning in Markov Decision Processes with Changing Cost Sequences

This paper considers online learning in finite Markov decision processes (MDPs) with changing cost sequences under full and bandit-information with two methods for this problem: MD2 (mirror descent with approximate projections) and the continuous exponential weights algorithm with Dikin walks.

Online Markov Decision Processes Under Bandit Feedback

It is shown that after T time steps, the expected regret of this algorithm (more precisely, a slightly modified version thereof) is O(T1/2lnT), giving the first rigorously proven, essentially tight regret bound for the problem.

Markov Decision Processes with Arbitrary Reward Processes

An efficient online algorithm is presented that ensures that the agent's average performance loss vanishes over time, provided that the environment is oblivious to the agents' actions.

Markov Decision Processes: Discrete Stochastic Dynamic Programming

  • M. Puterman
  • Computer Science
    Wiley Series in Probability and Statistics
  • 1994
Markov Decision Processes covers recent research advances in such areas as countable state space models with average reward criterion, constrained models, and models with risk sensitive optimality criteria, and explores several topics that have received little or no attention in other books.

Online learning in episodic Markovian decision processes by relative entropy policy search

A variant of the recently proposed Relative Entropy Policy Search algorithm is described and it is shown that its regret after T episodes is 2√L|X||A|T log (|X ||A|/L) in the bandit setting and 2L √T log(|X|A|)/L in the full information setting, given that the learner has perfect knowledge of the transition probabilities of the underlying MDP.

Online Learning in Markov Decision Processes with Adversarially Chosen Transition Probability Distributions

It is shown that designing efficient algorithms for the adversarial online shortest path problem is as hard as learning parity with noise, a notoriously difficult problem that has been used to design efficient cryptographic schemes.

Discrete-time controlled Markov processes with average cost criterion: a survey

This work is a survey of the average cost control problem for discrete-time Markov processes. The authors have attempted to put together a comprehensive account of the considerable research on this


This paper presents and proves in detail a convergence theorem forQ-learning based on that outlined in Watkins (1989), showing that Q-learning converges to the optimum action-values with probability 1 so long as all actions are repeatedly sampled in all states and the action- values are represented discretely.

Planning in the Presence of Cost Functions Controlled by an Adversary

This work investigates methods for planning in a Markov Decision Process where the cost function is chosen by an adversary after the authors fix their policy and develops efficient algorithms for matrix games where such best response oracles exist.

Deterministic MDPs with Adversarial Rewards and Bandit Feedback

Under mild assumptions on the structure of the transition dynamics, it is proved that MarcoPolo enjoys a regret of O(T3/4 √log T) against the best deterministic policy in hindsight.