• Corpus ID: 246294942

Reward-Free RL is No Harder Than Reward-Aware RL in Linear Markov Decision Processes

  title={Reward-Free RL is No Harder Than Reward-Aware RL in Linear Markov Decision Processes},
  author={Andrew J. Wagenmaker and Yifang Chen and Max Simchowitz and Simon Shaolei Du and Kevin G. Jamieson},
  booktitle={International Conference on Machine Learning},
Reward-free reinforcement learning (RL) consid-ers the setting where the agent does not have access to a reward function during exploration, but must propose a near-optimal policy for an arbitrary reward function revealed only after exploring. In the the tabular setting, it is well known that this is a more difficult problem than reward-aware (PAC) RL—where the agent has access to the reward function during exploration—with optimal sample complexities in the two settings differing by a factor of… 

On the Statistical Efficiency of Reward-Free Exploration in Non-Linear RL

The RFO LIVE (Reward-Free O LIVE) algorithm is proposed for sample-efficient reward-free exploration under minimal structural assumptions, which covers the previously stud-ied settings of linear MDPs, linear completeness, and low-rank M DPs with unknown representation.

Unified Algorithms for RL with Decision-Estimation Coefficients: No-Regret, PAC, and Reward-Free Learning

Two new DEC-type complexity measures are proposed: Explorative DEC (EDEC), and Reward-Free DEC (RFDEC) which are shown to be necessary and sufficient for sample-efficient PAC learning and reward-free learning, thereby extending the original DEC which only captures no-regret learning.

Best Policy Identification in Linear MDPs

An instance-specific lower bound is derived on the expected number of samples required to identify an ε -optimal policy with probability 1 − δ that characterizes the optimal sampling rule as the solution of an intricate non-convex optimization program, but can be used as the starting point to devise simple and near-optimal sampling rules and algorithms.

Instance-Dependent Near-Optimal Policy Identification in Linear MDPs via Online Experiment Design

This work proposes an algorithm, Pedel, which achieves a fine-grained instance-dependent measure of complexity, the first of its kind in the RL with function approximation setting, thereby capturing the difficulty of learning on each particular problem instance.

Leveraging Offline Data in Online Reinforcement Learning

This work characterize the necessary number of online samples needed in this setting given access to some offline dataset, and develops an algorithm, FTPedel, which is provably optimal, for MDPs with linear structure.

Provable Benefits of Representational Transfer in Reinforcement Learning

A new notion of task relatedness between source and target tasks is proposed, and a novel approach for representational transfer under this assumption is developed, showing that given a generative access to source tasks, one can discover a representation, using which subsequent linear RL techniques quickly converge to a near-optimal policy.



On Reward-Free Reinforcement Learning with Linear Function Approximation

An algorithm for reward-free RL in the linear Markov decision process setting where both the transition and the reward admit linear representations is given, and the sample complexity is polynomial in the feature dimension and the planning horizon, and is completely independent of the number of states and actions.

Reward-Free Model-Based Reinforcement Learning with Linear Function Approximation

It is proved that for any reward-free algorithm, it needs to sample at least Ω̃(Hd −2) episodes to obtain an -optimal policy, and a new provably efficient algorithm, called UCRL-RFE is proposed under the Linear Mixture MDP assumption, where the transition probability kernel of the MDP can be parameterized by a linear function over certain feature mappings defined on the triplet of state, action, and next state.

Reward-Free Exploration for Reinforcement Learning

An efficient algorithm is given that conducts episodes of exploration and returns near-suboptimal policies for an arbitrary number of reward functions, and a nearly-matching $\Omega(S^2AH^2/\epsilon^2)$ lower bound is given, demonstrating the near-optimality of the algorithm in this setting.

Gap-Dependent Unsupervised Exploration for Reinforcement Learning

An efficient algorithm is provided that takes only Õ(1/ · (HSA/ρ+HSA)) episodes of exploration, and is able to obtain an -optimal policy for a post-revealed reward with sub-optimality gap at least ρ, obtaining a nearly quadratic saving in terms of .

Provably Efficient Reward-Agnostic Navigation with Linear Value Iteration

This work shows how under a more standard notion of low inherent Bellman error, typically employed in least-square value iteration-style algorithms, this algorithm can provide strong PAC guarantees on learning a near optimal value function provided that the linear space is sufficiently "explorable".

Online Sparse Reinforcement Learning

A lower bound is provided showing that if the learner has oracle access to a policy that collects well-conditioned data then a variant of Lasso fitted Q-iteration enjoys a nearly dimension-free regret, which shows that in the large-action setting, the difficulty of learning can be attributed to the difficulties of finding a good exploratory policy.

First-Order Regret in Reinforcement Learning with Linear Function Approximation: A Robust Estimation Approach

It is shown that it is possible to obtain regret scaling as (cid:101) O ( ( cid:112) d 3 H 3 · V (cids) 1 · K + d 3 .

Nearly Minimax Optimal Reward-free Reinforcement Learning

A new efficient algorithm is given, which interacts with the environment at most $O\left( \frac{S^2A}{\epsilon^2}\text{poly}\log\left(\frac{SAH}{\Epsilon}\right) \right)$ episodes in the exploration phase, and guarantees to output a near-optimal policy for arbitrary reward functions in the planning phase.

Is Q-learning Provably Efficient?

Q-learning with UCB exploration achieves regret in an episodic MDP setting, and this is the first analysis in the model-free setting that establishes $\sqrt{T}$ regret without requiring access to a "simulator."

Beyond No Regret: Instance-Dependent PAC Reinforcement Learning

This work shows that there exists a fundamental tradeoff between achieving low regret and identifying an -optimal policy at the instance-optimal rate, and proposes a new measure of instance-dependent sample complexity for PAC tabular reinforcement learning which explicitly accounts for the attainable state visitation distributions in the underlying MDP.