# Provably Efficient Lifelong Reinforcement Learning with Linear Function Approximation

@article{Amani2022ProvablyEL, title={Provably Efficient Lifelong Reinforcement Learning with Linear Function Approximation}, author={Sanae Amani and Lin F. Yang and Ching-An Cheng}, journal={ArXiv}, year={2022}, volume={abs/2206.00270} }

We study lifelong reinforcement learning (RL) in a regret minimization setting of linear contextual Markov decision process (MDP), where the agent needs to learn a multi-task policy while solving a streaming sequence of tasks. We propose an algorithm, called UCB Lifelong Value Distillation (UCBlvd), that provably achieves sublinear regret for any sequence of tasks, which may be adaptively chosen based on the agent’s past behaviors. Remarkably, our algorithm uses only sublinear number of…

## References

SHOWING 1-10 OF 35 REFERENCES

No-regret Exploration in Contextual Reinforcement Learning

- Computer ScienceUAI
- 2020

This paper proposes and analyzes optimistic and randomized exploration methods which make (time and space) efficient online updates and demonstrates a generic template to derive confidence sets using an online learning oracle and gives a lower bound for the setting.

Policy and Value Transfer in Lifelong Reinforcement Learning

- Computer ScienceICML
- 2018

This work identifies the initial policy that optimizes expected performance over the distribution of tasks for increasingly complex classes of policy and task distributions, and considers value-function initialization methods that preserve PAC guarantees while simultaneously minimizing the learning required in two learning algorithms.

Safe Policy Search for Lifelong Reinforcement Learning with Sublinear Regret

- Computer ScienceICML
- 2015

It is demonstrated, for the first time, sublinear regret for lifelong policy search, and the algorithm is validated on several benchmark dynamical systems and an application to quadrotor control.

Provably Efficient Reinforcement Learning with Linear Function Approximation

- Computer ScienceCOLT
- 2020

This paper proves that an optimistic modification of Least-Squares Value Iteration (LSVI) achieves regret, where d is the ambient dimension of feature space, H is the length of each episode, and T is the total number of steps, and is independent of the number of states and actions.

Lipschitz Lifelong Reinforcement Learning

- Computer Science, MathematicsAAAI
- 2021

A novel metric between Markov Decision Processes is introduced and established that close MDPs have close optimal value functions, which lead to a value transfer method for Lifelong RL, which is used to build a PAC-MDP algorithm with improved convergence rate.

Accommodating Picky Customers: Regret Bound and Exploration Complexity for Multi-Objective Reinforcement Learning

- Computer ScienceNeurIPS
- 2021

This paper provides a model-based algorithm that achieves a regret bound and considers preference-free exploration, i.e., the agent first interacts with the environment without specifying any preference and then is able to accommodate arbitrary preference vectors up to $\epsilon$ error.

Markov Decision Processes with Continuous Side Information

- Computer ScienceALT
- 2018

This work considers a reinforcement learning (RL) setting in which the agent interacts with a sequence of episodic MDPs and proposes algorithms for learning in such Contextual Markov Decision Processes (CMDPs) under an assumption that the unobserved MDP parameters vary smoothly with the observed context.

PAC-inspired Option Discovery in Lifelong Reinforcement Learning

- Computer ScienceICML
- 2014

This work provides the first formal analysis of the sample complexity, a measure of learning speed, of reinforcement learning with options, and inspires a novel option-discovery algorithm that aims at minimizing overall sample complexity in lifelong reinforcement learning.

A Provably Efficient Algorithm for Linear Markov Decision Process with Low Switching Cost

- Computer ScienceArXiv
- 2021

This paper presents the first algorithm for linear MDP with a low switching cost and achieves an ‹ O Ä√ dHK ä regret bound with a near-optimal O (dH logK) global switching cost where d is the feature dimension, H is the planning horizon and K is the number of episodes the agent plays.