Nearly Minimax Optimal Reinforcement Learning with Linear Function Approximation

@inproceedings{Hu2022NearlyMO,
title={Nearly Minimax Optimal Reinforcement Learning with Linear Function Approximation},
author={Pihe Hu and Yu Chen and Longbo Huang},
booktitle={International Conference on Machine Learning},
year={2022}
}
• Pihe HuLongbo Huang
• Published in
International Conference on…
23 June 2022
• Computer Science
We study reinforcement learning with linear function approximation where the transition probability and reward functions are linear with respect to a feature mapping ϕ p s, a q . Specifically, we consider the episodic inhomogeneous linear Markov Decision Process (MDP), and propose a novel computation-efficient algorithm, LSVI-UCB  , which achieves an r O p Hd ? T q regret bound where H is the episode length, d is the feature dimension, and T is the number of steps. LSVI-UCB  builds on…

References

SHOWING 1-10 OF 34 REFERENCES

• Computer Science
COLT
• 2021
A new Bernstein-type concentration inequality for self-normalized martingales for linear bandit problems with bounded noise and a new, computationally efficient algorithm with linear function approximation named UCRL-VTR for the aforementioned linear mixture MDPs in the episodic undiscounted setting are proposed.
• Computer Science
COLT
• 2020
This paper proves that an optimistic modification of Least-Squares Value Iteration (LSVI) achieves regret, where d is the ambient dimension of feature space, H is the length of each episode, and T is the total number of steps, and is independent of the number of states and actions.
• Mathematics
• 2020
sets of environments and policies respectively and  : E ×Π→ [0, 1] a bounded loss function. Given a policy π let (π) = ((ν1, π), . . . , (νN , π)) be the loss vector resulting from policy π.
• Mathematics, Computer Science
ArXiv
• 2022
This work proposes the first computationally computationally efficient algorithm that achieves the nearly minimax optimal regret for episodic time-inhomogeneous linear Markov decision processes (linear MDPs).
• Computer Science, Mathematics
NeurIPS
• 2020
This paper establishes a provably efficient RL algorithm with general value function approximation that achieves a regret bound of $\widetilde{O}(\mathrm{poly}(dH)\sqrt{T})$ and provides a framework to justify the effectiveness of algorithms used in practice.
• Computer Science
L4DC
• 2020
This paper proposes a model based RL algorithm that is based on optimism principle, and derives a bound on the regret, which is independent of the total number of states or actions, and is close to a lower bound $\Omega(\sqrt{HdT})$.
• Computer Science
ICML
• 2020
A new optimistic algorithm is proposed based on a finer examination of the non-linearities of the reward function that enjoys a $\tilde{\mathcal{O}}(\sqrt{T})$ regret with no dependency in $\kappa$, but for a second order term.
• Computer Science
ICML
• 2020
This paper proves that, in the problem of episodic Markov decision process with linear function approximation, unknown transition, and adversarial reward with full-information feedback, OPPO achieves regret.
• Computer Science
ICLR
• 2021
This work designs a new provably efficient algorithm for episodic reinforcement learning with generalized linear function approximation that enjoys a regret bound of $\tilde{O}(\sqrt{d^3 T})$ where d is the dimensionality of the state-action features and T is the number of episodes.