Nearly Minimax Optimal Reinforcement Learning with Linear Function Approximation

  title={Nearly Minimax Optimal Reinforcement Learning with Linear Function Approximation},
  author={Pihe Hu and Yu Chen and Longbo Huang},
  booktitle={International Conference on Machine Learning},
  • Pihe HuYu ChenLongbo Huang
  • Published in
    International Conference on…
    23 June 2022
  • Computer Science
We study reinforcement learning with linear function approximation where the transition probability and reward functions are linear with respect to a feature mapping ϕ p s, a q . Specifically, we consider the episodic inhomogeneous linear Markov Decision Process (MDP), and propose a novel computation-efficient algorithm, LSVI-UCB ` , which achieves an r O p Hd ? T q regret bound where H is the episode length, d is the feature dimension, and T is the number of steps. LSVI-UCB ` builds on… 

Figures and Tables from this paper



Nearly Minimax Optimal Reinforcement Learning for Linear Mixture Markov Decision Processes

A new Bernstein-type concentration inequality for self-normalized martingales for linear bandit problems with bounded noise and a new, computationally efficient algorithm with linear function approximation named UCRL-VTR for the aforementioned linear mixture MDPs in the episodic undiscounted setting are proposed.

Provably Efficient Reinforcement Learning with Linear Function Approximation

This paper proves that an optimistic modification of Least-Squares Value Iteration (LSVI) achieves regret, where d is the ambient dimension of feature space, H is the length of each episode, and T is the total number of steps, and is independent of the number of states and actions.

On Tail Probabilities for Martingales

Bandit Algorithms

sets of environments and policies respectively and ` : E ×Π→ [0, 1] a bounded loss function. Given a policy π let `(π) = (`(ν1, π), . . . , `(νN , π)) be the loss vector resulting from policy π.

Nearly Minimax Optimal Reinforcement Learning for Linear Markov Decision Processes

This work proposes the first computationally computationally efficient algorithm that achieves the nearly minimax optimal regret for episodic time-inhomogeneous linear Markov decision processes (linear MDPs).

Reinforcement Learning with General Value Function Approximation: Provably Efficient Approach via Bounded Eluder Dimension

This paper establishes a provably efficient RL algorithm with general value function approximation that achieves a regret bound of $\widetilde{O}(\mathrm{poly}(dH)\sqrt{T})$ and provides a framework to justify the effectiveness of algorithms used in practice.

Model-Based Reinforcement Learning with Value-Targeted Regression

This paper proposes a model based RL algorithm that is based on optimism principle, and derives a bound on the regret, which is independent of the total number of states or actions, and is close to a lower bound $\Omega(\sqrt{HdT})$.

Improved Optimistic Algorithms for Logistic Bandits

A new optimistic algorithm is proposed based on a finer examination of the non-linearities of the reward function that enjoys a $\tilde{\mathcal{O}}(\sqrt{T})$ regret with no dependency in $\kappa$, but for a second order term.

Provably Efficient Exploration in Policy Optimization

This paper proves that, in the problem of episodic Markov decision process with linear function approximation, unknown transition, and adversarial reward with full-information feedback, OPPO achieves regret.

Optimism in Reinforcement Learning with Generalized Linear Function Approximation

This work designs a new provably efficient algorithm for episodic reinforcement learning with generalized linear function approximation that enjoys a regret bound of $\tilde{O}(\sqrt{d^3 T})$ where d is the dimensionality of the state-action features and T is the number of episodes.