# Nearly Minimax Optimal Reinforcement Learning for Linear Mixture Markov Decision Processes

@article{Zhou2020NearlyMO, title={Nearly Minimax Optimal Reinforcement Learning for Linear Mixture Markov Decision Processes}, author={Dongruo Zhou and Quanquan Gu and Csaba Szepesvari}, journal={ArXiv}, year={2020}, volume={abs/2012.08507} }

We study reinforcement learning (RL) with linear function approximation where the underlying transition probability kernel of the Markov decision process (MDP) is a linear mixture model (Jia et al., 2020; Ayoub et al., 2020; Zhou et al., 2020) and the learning agent has access to either an integration or a sampling oracle of the individual basis kernels. We propose a new Bernstein-type concentration inequality for self-normalized martingales for linear bandit problems with bounded noise. Based…

## 96 Citations

### Nearly Minimax Optimal Reinforcement Learning for Linear Markov Decision Processes

- Mathematics, Computer Science
- 2022

This work proposes the first computationally computationally efficient algorithm that achieves the nearly minimax optimal regret for episodic time-inhomogeneous linear Markov decision processes (linear MDPs).

### Nearly Minimax Optimal Reinforcement Learning with Linear Function Approximation

- Computer ScienceICML
- 2022

This work considers the episodic inhomogeneous linear Markov Decision Process (MDP), and proposes a novel computation-efficient algorithm, LSVI-UCB `, which achieves an r O p Hd ?

### Computationally Efficient Horizon-Free Reinforcement Learning for Linear Mixture MDPs

- Computer ScienceArXiv
- 2022

This paper proposes the first computationally eﬃcient horizon-free algorithm for linear mixture MDPs, which achieves the optimal (cid:101) O ( d √ K + d 2 ) regret up to logarithmic factors.

### Nearly Minimax Optimal Regret for Learning Infinite-horizon Average-reward MDPs with Linear Function Approximation

- Computer ScienceAISTATS
- 2022

The proposed UCRL2-VTR with Bernstein-type bonus is the first nearly minimax optimal RL algorithm with function approximation in the inﬁnite-horizon average-reward setting and a matching lower bound is proved, which suggests that this algorithm is minimax ideal up to logarithmic factors.

### Near-optimal Policy Optimization Algorithms for Learning Adversarial Linear Mixture MDPs

- Computer ScienceAISTATS
- 2022

This paper proposes an optimistic policy optimization algorithm POWERS and shows that it can achieve regret, and proves a matching lower bound of (cid:101) Ω( dH √ T ) up to logarithmic factors.

### Model Selection with Near Optimal Rates for Reinforcement Learning with General Model Classes

- Computer ScienceArXiv
- 2021

A novel algorithm, namely Adaptive Reinforcement Learning (General) (ARL-GEN) is proposed that adapts to the smallest such family where the true transition kernel P ∗ lies, and obtains regret identical to that of an oracle with knowledge of the true model class.

### Sample-Efficient Reinforcement Learning for POMDPs with Linear Function Approximations

- Mathematics, Computer ScienceArXiv
- 2022

An RL algorithm is proposed that constructs optimistic estimators of undercomplete POMDPs with linear function approximations via reproducing kernel Hilbert space (RKHS) embedding and it is theoretically proved that the proposed algorithm has an ε -optimal policy with e O (1 /ε 2 ) episodes of exploration.

### Model Selection for Generic Reinforcement Learning

- Computer Science
- 2021

A novel algorithm, namely Adaptive Reinforcement Learning (General) (ARL-GEN) is proposed that adapts to the smallest such family where the true transition kernel P ∗ lies and obtains regret identical to that of an oracle with knowledge of the true model class.

### Model-Based Reinforcement Learning with Multinomial Logistic Function Approximation

- Computer ScienceArXiv
- 2022

This paper establishes a provably eﬃcient RL algorithm for the MDP whose state transition is given by a multinomial logistic model, and comprehensively evaluates the proposed algorithm numerically and shows that it consistently outperforms the existing methods, hence achieving both provable e-ciency and practical superior performance.

### Reward-Free Model-Based Reinforcement Learning with Linear Function Approximation

- Computer ScienceNeurIPS
- 2021

It is proved that for any reward-free algorithm, it needs to sample at least Ω̃(Hd −2) episodes to obtain an -optimal policy, and a new provably efficient algorithm, called UCRL-RFE is proposed under the Linear Mixture MDP assumption, where the transition probability kernel of the MDP can be parameterized by a linear function over certain feature mappings defined on the triplet of state, action, and next state.

## References

SHOWING 1-10 OF 76 REFERENCES

### Learning Near Optimal Policies with Low Inherent Bellman Error

- Computer ScienceICML
- 2020

We study the exploration problem with approximate linear action-value functions in episodic reinforcement learning under the notion of low inherent Bellman error, a condition normally employed to…

### Sample-Optimal Parametric Q-Learning Using Linearly Additive Features

- Computer ScienceICML
- 2019

This work proposes a parametric Q-learning algorithm that finds an approximate-optimal policy using a sample size proportional to the feature dimension $K$ and invariant with respect to the size of the state space, and exploits the monotonicity property and intrinsic noise structure of the Bellman operator.

### Almost Optimal Model-Free Reinforcement Learning via Reference-Advantage Decomposition

- Computer ScienceNeurIPS
- 2020

A model-free algorithm UCB-Advantage is proposed and it is proved that it achieves $\tilde{O}(\sqrt{H^2SAT})$ regret where $T = KH$ and $K$ is the number of episodes to play.

### Near-Optimal Time and Sample Complexities for Solving Discounted Markov Decision Process with a Generative Model

- Computer Science
- 2019

An algorithm which computes an $\epsilon$-optimal policy with probability $1 - \delta$ and matches the sample complexity lower bounds proved in Azar et al. (2013) up to logarithmic factors is provided.

### Reinforcement Learning with General Value Function Approximation: Provably Efficient Approach via Bounded Eluder Dimension

- Computer Science, MathematicsNeurIPS
- 2020

This paper establishes a provably efficient RL algorithm with general value function approximation that achieves a regret bound of $\widetilde{O}(\mathrm{poly}(dH)\sqrt{T})$ and provides a framework to justify the effectiveness of algorithms used in practice.

### Learning with Good Feature Representations in Bandits and in RL with a Generative Model

- Computer ScienceICML
- 2020

The Kiefer-Wolfowitz theorem is used to prove a positive result that by checking only a few actions, a learner can always find an action that is suboptimal with an error of at most $O(\epsilon \sqrt{d})$.

### Minimax Optimal Reinforcement Learning for Discounted MDPs

- Computer ScienceArXiv
- 2020

The upper bound matches the minimax lower bound up to logarithmic factors, which suggests that UCBVI-$\gamma$ is near optimal for discounted MDPs.

### Reinforcement Leaning in Feature Space: Matrix Bandit, Kernels, and Regret Bound

- Computer ScienceICML
- 2020

These results are the first regret bounds that are near-optimal in time $T$ and dimension $d$ (or $\widetilde{d}$) and polynomial in the planning horizon $H$.

### Is Reinforcement Learning More Difficult Than Bandits? A Near-optimal Algorithm Escaping the Curse of Horizon

- Computer ScienceCOLT
- 2021

The current paper shows that the long planning horizon and the unknown state-dependent transitions (at most) pose little additional difficulty on sample complexity, and improves the state-of-the-art polynomial-time algorithms.

### Frequentist Regret Bounds for Randomized Least-Squares Value Iteration

- Computer ScienceAISTATS
- 2020

It is proved that the frequentist regret of RLSVI is upper-bounded by $\widetilde O(d^2 H^2 \sqrt{T})$ where d are the feature dimension, H is the horizon, and T is the total number of steps.