• Corpus ID: 244425379

Near-optimal Policy Optimization Algorithms for Learning Adversarial Linear Mixture MDPs

@inproceedings{He2021NearoptimalPO,
  title={Near-optimal Policy Optimization Algorithms for Learning Adversarial Linear Mixture MDPs},
  author={Jiafan He and Dongruo Zhou and Quanquan Gu},
  booktitle={International Conference on Artificial Intelligence and Statistics},
  year={2021}
}
Learning Markov decision processes (MDPs) in the presence of the adversary is a chal-lenging problem in reinforcement learning (RL). In this paper, we study RL in episodic MDPs with adversarial reward and full information feedback, where the unknown transition probability function is a linear function of a given feature mapping, and the reward function can change arbitrarily episode by episode. We propose an optimistic policy optimization algorithm POWERS and show that it can achieve (cid:101… 

Figures from this paper

Near-Optimal Regret for Adversarial MDP with Delayed Bandit Feedback

This paper presents the first algorithms that achieve near-optimal √ K +D regret, where K is the number of episodes and D = ∑K k=1 d k is the total delay, significantly improving upon the best known regret bound of (K +D).

Nearly Minimax Optimal Regret for Learning Infinite-horizon Average-reward MDPs with Linear Function Approximation

The proposed UCRL2-VTR with Bernstein-type bonus is the first nearly minimax optimal RL algorithm with function approximation in the infinite-horizon average-reward setting and a matching lower bound is proved, which suggests that this algorithm is minimax ideal up to logarithmic factors.

Learning Stochastic Shortest Path with Linear Function Approximation

A novel algorithm with Hoeffding-type confidence sets for learning the linear mixture SSP, which provably achieves an near-optimal regret guarantee and proves a lower bound of Ω( dB (cid:63) √ K ) .

Refined Regret for Adversarial MDPs with Linear Function Approximation

  • Yan DaiHaipeng LuoChen-Yu WeiJulian Zimmert
  • Computer Science, Mathematics
  • 2023
Two algorithms that improve the regret to e O ( √ K ) in the same setting by using a refined analysis of the Follow-the-Regularized-Leader algorithm with the log-barrier regularizer and developing a magnitude-reduced loss estimator.

Improved Regret for Efficient Online Reinforcement Learning with Linear Function Approximation

This work presents a computationally efficient policy optimization algorithm for the challenging general setting of unknown dynamics and bandit feedback, featuring a combination of mirror-descent and least squares policy evaluation in an auxiliary MDP used to compute exploration bonuses.

References

SHOWING 1-10 OF 48 REFERENCES

Nearly Minimax Optimal Regret for Learning Infinite-horizon Average-reward MDPs with Linear Function Approximation

The proposed UCRL2-VTR with Bernstein-type bonus is the first nearly minimax optimal RL algorithm with function approximation in the infinite-horizon average-reward setting and a matching lower bound is proved, which suggests that this algorithm is minimax ideal up to logarithmic factors.

Online Convex Optimization in Adversarial Markov Decision Processes

We consider online learning in episodic loop-free Markov decision processes (MDPs), where the loss function can change arbitrarily between episodes, and the transition function is not known to the

Provably Efficient Adaptive Approximate Policy Iteration

Model-free reinforcement learning algorithms combined with value function approximation have recently achieved impressive performance in a variety of application domains, including games and

Optimistic Policy Optimization with Bandit Feedback

This paper considers model-based RL in the tabular finite-horizon MDP setting with unknown transitions and bandit feedback, and proposes an optimistic trust region policy optimization (TRPO) algorithm, which establishes regret for stochastic rewards and proves regret for adversarial rewards.

Nearly Minimax Optimal Reinforcement Learning for Linear Mixture Markov Decision Processes

A new Bernstein-type concentration inequality for self-normalized martingales for linear bandit problems with bounded noise and a new, computationally efficient algorithm with linear function approximation named UCRL-VTR for the aforementioned linear mixture MDPs in the episodic undiscounted setting are proposed.

Provably Efficient Exploration in Policy Optimization

This paper proves that, in the problem of episodic Markov decision process with linear function approximation, unknown transition, and adversarial reward with full-information feedback, OPPO achieves regret.

Learning Adversarial Markov Decision Processes with Bandit Feedback and Unknown Transition

We consider the problem of learning in episodic finite-horizon Markov decision processes with unknown transition function, bandit feedback, and adversarial losses. We propose an efficient algorithm

The adversarial stochastic shortest path problem with unknown transition probabilities

This paper proposes an algorithm called “follow the perturbed optimistic policy”, an algorithm that learns and controls stochastic and adversarial components in an online fashion at the same time, and it is proved that the expected cumulative regret of the algorithm is of order L||A| p T up to logarithmic factors.

Online learning in episodic Markovian decision processes by relative entropy policy search

A variant of the recently proposed Relative Entropy Policy Search algorithm is described and it is shown that its regret after T episodes is 2√L|X||A|T log (|X ||A|/L) in the bandit setting and 2L √T log(|X|A|)/L in the full information setting, given that the learner has perfect knowledge of the transition probabilities of the underlying MDP.

A unified view of entropy-regularized Markov decision processes

A general framework for entropy-regularized average-reward reinforcement learning in Markov decision processes (MDPs) is proposed, showing that using the conditional entropy of the joint state-action distributions as regularization yields a dual optimization problem closely resembling the Bellman optimality equations.