Nearly Minimax Optimal Reinforcement Learning with Linear Function Approximation
@inproceedings{Hu2022NearlyMO, title={Nearly Minimax Optimal Reinforcement Learning with Linear Function Approximation}, author={Pihe Hu and Yu Chen and Longbo Huang}, booktitle={International Conference on Machine Learning}, year={2022} }
We study reinforcement learning with linear function approximation where the transition probability and reward functions are linear with respect to a feature mapping ϕ p s, a q . Specifically, we consider the episodic inhomogeneous linear Markov Decision Process (MDP), and propose a novel computation-efficient algorithm, LSVI-UCB ` , which achieves an r O p Hd ? T q regret bound where H is the episode length, d is the feature dimension, and T is the number of steps. LSVI-UCB ` builds on…
References
SHOWING 1-10 OF 34 REFERENCES
Nearly Minimax Optimal Reinforcement Learning for Linear Mixture Markov Decision Processes
- Computer ScienceCOLT
- 2021
A new Bernstein-type concentration inequality for self-normalized martingales for linear bandit problems with bounded noise and a new, computationally efficient algorithm with linear function approximation named UCRL-VTR for the aforementioned linear mixture MDPs in the episodic undiscounted setting are proposed.
Provably Efficient Reinforcement Learning with Linear Function Approximation
- Computer ScienceCOLT
- 2020
This paper proves that an optimistic modification of Least-Squares Value Iteration (LSVI) achieves regret, where d is the ambient dimension of feature space, H is the length of each episode, and T is the total number of steps, and is independent of the number of states and actions.
Bandit Algorithms
- Mathematics
- 2020
sets of environments and policies respectively and ` : E ×Π→ [0, 1] a bounded loss function. Given a policy π let `(π) = (`(ν1, π), . . . , `(νN , π)) be the loss vector resulting from policy π.…
Nearly Minimax Optimal Reinforcement Learning for Linear Markov Decision Processes
- Mathematics, Computer ScienceArXiv
- 2022
This work proposes the first computationally computationally efficient algorithm that achieves the nearly minimax optimal regret for episodic time-inhomogeneous linear Markov decision processes (linear MDPs).
Reinforcement Learning with General Value Function Approximation: Provably Efficient Approach via Bounded Eluder Dimension
- Computer Science, MathematicsNeurIPS
- 2020
This paper establishes a provably efficient RL algorithm with general value function approximation that achieves a regret bound of $\widetilde{O}(\mathrm{poly}(dH)\sqrt{T})$ and provides a framework to justify the effectiveness of algorithms used in practice.
Model-Based Reinforcement Learning with Value-Targeted Regression
- Computer ScienceL4DC
- 2020
This paper proposes a model based RL algorithm that is based on optimism principle, and derives a bound on the regret, which is independent of the total number of states or actions, and is close to a lower bound $\Omega(\sqrt{HdT})$.
Improved Optimistic Algorithms for Logistic Bandits
- Computer ScienceICML
- 2020
A new optimistic algorithm is proposed based on a finer examination of the non-linearities of the reward function that enjoys a $\tilde{\mathcal{O}}(\sqrt{T})$ regret with no dependency in $\kappa$, but for a second order term.
Provably Efficient Exploration in Policy Optimization
- Computer ScienceICML
- 2020
This paper proves that, in the problem of episodic Markov decision process with linear function approximation, unknown transition, and adversarial reward with full-information feedback, OPPO achieves regret.
Optimism in Reinforcement Learning with Generalized Linear Function Approximation
- Computer ScienceICLR
- 2021
This work designs a new provably efficient algorithm for episodic reinforcement learning with generalized linear function approximation that enjoys a regret bound of $\tilde{O}(\sqrt{d^3 T})$ where d is the dimensionality of the state-action features and T is the number of episodes.