# Near-optimal Policy Optimization Algorithms for Learning Adversarial Linear Mixture MDPs

@inproceedings{He2021NearoptimalPO, title={Near-optimal Policy Optimization Algorithms for Learning Adversarial Linear Mixture MDPs}, author={Jiafan He and Dongruo Zhou and Quanquan Gu}, booktitle={International Conference on Artificial Intelligence and Statistics}, year={2021} }

Learning Markov decision processes (MDPs) in the presence of the adversary is a chal-lenging problem in reinforcement learning (RL). In this paper, we study RL in episodic MDPs with adversarial reward and full information feedback, where the unknown transition probability function is a linear function of a given feature mapping, and the reward function can change arbitrarily episode by episode. We propose an optimistic policy optimization algorithm POWERS and show that it can achieve (cid:101…

## 5 Citations

### Near-Optimal Regret for Adversarial MDP with Delayed Bandit Feedback

- Computer ScienceArXiv
- 2022

This paper presents the first algorithms that achieve near-optimal √ K +D regret, where K is the number of episodes and D = ∑K k=1 d k is the total delay, significantly improving upon the best known regret bound of (K +D).

### Nearly Minimax Optimal Regret for Learning Infinite-horizon Average-reward MDPs with Linear Function Approximation

- Computer ScienceAISTATS
- 2022

The proposed UCRL2-VTR with Bernstein-type bonus is the first nearly minimax optimal RL algorithm with function approximation in the inﬁnite-horizon average-reward setting and a matching lower bound is proved, which suggests that this algorithm is minimax ideal up to logarithmic factors.

### Learning Stochastic Shortest Path with Linear Function Approximation

- Computer ScienceICML
- 2022

A novel algorithm with Hoeffding-type conﬁdence sets for learning the linear mixture SSP, which provably achieves an near-optimal regret guarantee and proves a lower bound of Ω( dB (cid:63) √ K ) .

### Refined Regret for Adversarial MDPs with Linear Function Approximation

- Computer Science, Mathematics
- 2023

Two algorithms that improve the regret to e O ( √ K ) in the same setting by using a reﬁned analysis of the Follow-the-Regularized-Leader algorithm with the log-barrier regularizer and developing a magnitude-reduced loss estimator.

### Improved Regret for Efficient Online Reinforcement Learning with Linear Function Approximation

- Computer Science
- 2023

This work presents a computationally efficient policy optimization algorithm for the challenging general setting of unknown dynamics and bandit feedback, featuring a combination of mirror-descent and least squares policy evaluation in an auxiliary MDP used to compute exploration bonuses.

## References

SHOWING 1-10 OF 48 REFERENCES

### Nearly Minimax Optimal Regret for Learning Infinite-horizon Average-reward MDPs with Linear Function Approximation

- Computer ScienceAISTATS
- 2022

The proposed UCRL2-VTR with Bernstein-type bonus is the first nearly minimax optimal RL algorithm with function approximation in the inﬁnite-horizon average-reward setting and a matching lower bound is proved, which suggests that this algorithm is minimax ideal up to logarithmic factors.

### Online Convex Optimization in Adversarial Markov Decision Processes

- Computer ScienceICML
- 2019

We consider online learning in episodic loop-free Markov decision processes (MDPs), where the loss function can change arbitrarily between episodes, and the transition function is not known to the…

### Provably Efficient Adaptive Approximate Policy Iteration

- Computer ScienceArXiv
- 2020

Model-free reinforcement learning algorithms combined with value function approximation have recently achieved impressive performance in a variety of application domains, including games and…

### Optimistic Policy Optimization with Bandit Feedback

- Computer ScienceICML
- 2020

This paper considers model-based RL in the tabular finite-horizon MDP setting with unknown transitions and bandit feedback, and proposes an optimistic trust region policy optimization (TRPO) algorithm, which establishes regret for stochastic rewards and proves regret for adversarial rewards.

### Nearly Minimax Optimal Reinforcement Learning for Linear Mixture Markov Decision Processes

- Computer ScienceCOLT
- 2021

A new Bernstein-type concentration inequality for self-normalized martingales for linear bandit problems with bounded noise and a new, computationally efficient algorithm with linear function approximation named UCRL-VTR for the aforementioned linear mixture MDPs in the episodic undiscounted setting are proposed.

### Provably Efficient Exploration in Policy Optimization

- Computer ScienceICML
- 2020

This paper proves that, in the problem of episodic Markov decision process with linear function approximation, unknown transition, and adversarial reward with full-information feedback, OPPO achieves regret.

### Learning Adversarial Markov Decision Processes with Bandit Feedback and Unknown Transition

- Computer ScienceICML
- 2020

We consider the problem of learning in episodic finite-horizon Markov decision processes with unknown transition function, bandit feedback, and adversarial losses. We propose an efficient algorithm…

### The adversarial stochastic shortest path problem with unknown transition probabilities

- Computer Science, MathematicsAISTATS
- 2012

This paper proposes an algorithm called “follow the perturbed optimistic policy”, an algorithm that learns and controls stochastic and adversarial components in an online fashion at the same time, and it is proved that the expected cumulative regret of the algorithm is of order L||A| p T up to logarithmic factors.

### Online learning in episodic Markovian decision processes by relative entropy policy search

- Computer ScienceNIPS
- 2013

A variant of the recently proposed Relative Entropy Policy Search algorithm is described and it is shown that its regret after T episodes is 2√L|X||A|T log (|X ||A|/L) in the bandit setting and 2L √T log(|X|A|)/L in the full information setting, given that the learner has perfect knowledge of the transition probabilities of the underlying MDP.

### A unified view of entropy-regularized Markov decision processes

- Computer ScienceArXiv
- 2017

A general framework for entropy-regularized average-reward reinforcement learning in Markov decision processes (MDPs) is proposed, showing that using the conditional entropy of the joint state-action distributions as regularization yields a dual optimization problem closely resembling the Bellman optimality equations.