• Corpus ID: 229181126

# Nearly Minimax Optimal Reinforcement Learning for Linear Mixture Markov Decision Processes

@article{Zhou2020NearlyMO,
title={Nearly Minimax Optimal Reinforcement Learning for Linear Mixture Markov Decision Processes},
author={Dongruo Zhou and Quanquan Gu and Csaba Szepesvari},
journal={ArXiv},
year={2020},
volume={abs/2012.08507}
}
• Published 15 December 2020
• Computer Science
• ArXiv
We study reinforcement learning (RL) with linear function approximation where the underlying transition probability kernel of the Markov decision process (MDP) is a linear mixture model (Jia et al., 2020; Ayoub et al., 2020; Zhou et al., 2020) and the learning agent has access to either an integration or a sampling oracle of the individual basis kernels. We propose a new Bernstein-type concentration inequality for self-normalized martingales for linear bandit problems with bounded noise. Based…
96 Citations

## Figures from this paper

• Mathematics, Computer Science
• 2022
This work proposes the first computationally computationally efficient algorithm that achieves the nearly minimax optimal regret for episodic time-inhomogeneous linear Markov decision processes (linear MDPs).
• Pihe HuLongbo Huang
• Computer Science
ICML
• 2022
This work considers the episodic inhomogeneous linear Markov Decision Process (MDP), and proposes a novel computation-efficient algorithm, LSVI-UCB `, which achieves an r O p Hd ?
• Computer Science
ArXiv
• 2022
This paper proposes the first computationally eﬃcient horizon-free algorithm for linear mixture MDPs, which achieves the optimal (cid:101) O ( d √ K + d 2 ) regret up to logarithmic factors.
• Computer Science
AISTATS
• 2022
The proposed UCRL2-VTR with Bernstein-type bonus is the first nearly minimax optimal RL algorithm with function approximation in the inﬁnite-horizon average-reward setting and a matching lower bound is proved, which suggests that this algorithm is minimax ideal up to logarithmic factors.
• Computer Science
AISTATS
• 2022
This paper proposes an optimistic policy optimization algorithm POWERS and shows that it can achieve regret, and proves a matching lower bound of (cid:101) Ω( dH √ T ) up to logarithmic factors.
• Computer Science
ArXiv
• 2021
A novel algorithm, namely Adaptive Reinforcement Learning (General) (ARL-GEN) is proposed that adapts to the smallest such family where the true transition kernel P ∗ lies, and obtains regret identical to that of an oracle with knowledge of the true model class.
• Mathematics, Computer Science
ArXiv
• 2022
An RL algorithm is proposed that constructs optimistic estimators of undercomplete POMDPs with linear function approximations via reproducing kernel Hilbert space (RKHS) embedding and it is theoretically proved that the proposed algorithm has an ε -optimal policy with e O (1 /ε 2 ) episodes of exploration.
• Computer Science
• 2021
A novel algorithm, namely Adaptive Reinforcement Learning (General) (ARL-GEN) is proposed that adapts to the smallest such family where the true transition kernel P ∗ lies and obtains regret identical to that of an oracle with knowledge of the true model class.
• Computer Science
ArXiv
• 2022
This paper establishes a provably eﬃcient RL algorithm for the MDP whose state transition is given by a multinomial logistic model, and comprehensively evaluates the proposed algorithm numerically and shows that it consistently outperforms the existing methods, hence achieving both provable e-ciency and practical superior performance.
• Computer Science
NeurIPS
• 2021
It is proved that for any reward-free algorithm, it needs to sample at least Ω̃(Hd −2) episodes to obtain an -optimal policy, and a new provably efficient algorithm, called UCRL-RFE is proposed under the Linear Mixture MDP assumption, where the transition probability kernel of the MDP can be parameterized by a linear function over certain feature mappings defined on the triplet of state, action, and next state.

## References

SHOWING 1-10 OF 76 REFERENCES

• Computer Science
ICML
• 2020
We study the exploration problem with approximate linear action-value functions in episodic reinforcement learning under the notion of low inherent Bellman error, a condition normally employed to
• Computer Science
ICML
• 2019
This work proposes a parametric Q-learning algorithm that finds an approximate-optimal policy using a sample size proportional to the feature dimension $K$ and invariant with respect to the size of the state space, and exploits the monotonicity property and intrinsic noise structure of the Bellman operator.
• Computer Science
NeurIPS
• 2020
A model-free algorithm UCB-Advantage is proposed and it is proved that it achieves $\tilde{O}(\sqrt{H^2SAT})$ regret where $T = KH$ and $K$ is the number of episodes to play.
• Computer Science
• 2019
An algorithm which computes an $\epsilon$-optimal policy with probability $1 - \delta$ and matches the sample complexity lower bounds proved in Azar et al. (2013) up to logarithmic factors is provided.
• Computer Science, Mathematics
NeurIPS
• 2020
This paper establishes a provably efficient RL algorithm with general value function approximation that achieves a regret bound of $\widetilde{O}(\mathrm{poly}(dH)\sqrt{T})$ and provides a framework to justify the effectiveness of algorithms used in practice.
• Computer Science
ICML
• 2020
The Kiefer-Wolfowitz theorem is used to prove a positive result that by checking only a few actions, a learner can always find an action that is suboptimal with an error of at most $O(\epsilon \sqrt{d})$.
• Computer Science
ArXiv
• 2020
The upper bound matches the minimax lower bound up to logarithmic factors, which suggests that UCBVI-$\gamma$ is near optimal for discounted MDPs.
• Computer Science
ICML
• 2020
These results are the first regret bounds that are near-optimal in time $T$ and dimension $d$ (or $\widetilde{d}$) and polynomial in the planning horizon $H$.
• Computer Science
COLT
• 2021
The current paper shows that the long planning horizon and the unknown state-dependent transitions (at most) pose little additional difficulty on sample complexity, and improves the state-of-the-art polynomial-time algorithms.
• Computer Science
AISTATS
• 2020
It is proved that the frequentist regret of RLSVI is upper-bounded by $\widetilde O(d^2 H^2 \sqrt{T})$ where d are the feature dimension, H is the horizon, and T is the total number of steps.