• Corpus ID: 246294942

# Reward-Free RL is No Harder Than Reward-Aware RL in Linear Markov Decision Processes

@inproceedings{Wagenmaker2022RewardFreeRI,
title={Reward-Free RL is No Harder Than Reward-Aware RL in Linear Markov Decision Processes},
author={Andrew J. Wagenmaker and Yifang Chen and Max Simchowitz and Simon Shaolei Du and Kevin G. Jamieson},
booktitle={International Conference on Machine Learning},
year={2022}
}
• Published in
International Conference on…
26 January 2022
• Computer Science
Reward-free reinforcement learning (RL) consid-ers the setting where the agent does not have access to a reward function during exploration, but must propose a near-optimal policy for an arbitrary reward function revealed only after exploring. In the the tabular setting, it is well known that this is a more difﬁcult problem than reward-aware (PAC) RL—where the agent has access to the reward function during exploration—with optimal sample complexities in the two settings differing by a factor of…
• Computer Science
ArXiv
• 2022
The RFO LIVE (Reward-Free O LIVE) algorithm is proposed for sample-efﬁcient reward-free exploration under minimal structural assumptions, which covers the previously stud-ied settings of linear MDPs, linear completeness, and low-rank M DPs with unknown representation.
• Computer Science
• 2022
Two new DEC-type complexity measures are proposed: Explorative DEC (EDEC), and Reward-Free DEC (RFDEC) which are shown to be necessary and sufficient for sample-efficient PAC learning and reward-free learning, thereby extending the original DEC which only captures no-regret learning.
• Computer Science
ArXiv
• 2022
An instance-speciﬁc lower bound is derived on the expected number of samples required to identify an ε -optimal policy with probability 1 − δ that characterizes the optimal sampling rule as the solution of an intricate non-convex optimization program, but can be used as the starting point to devise simple and near-optimal sampling rules and algorithms.
• Computer Science
ArXiv
• 2022
This work proposes an algorithm, Pedel, which achieves a fine-grained instance-dependent measure of complexity, the first of its kind in the RL with function approximation setting, thereby capturing the difficulty of learning on each particular problem instance.
• Computer Science
ArXiv
• 2022
This work characterize the necessary number of online samples needed in this setting given access to some offline dataset, and develops an algorithm, FTPedel, which is provably optimal, for MDPs with linear structure.
• Computer Science
ArXiv
• 2022
A new notion of task relatedness between source and target tasks is proposed, and a novel approach for representational transfer under this assumption is developed, showing that given a generative access to source tasks, one can discover a representation, using which subsequent linear RL techniques quickly converge to a near-optimal policy.

## References

SHOWING 1-10 OF 51 REFERENCES

• Computer Science
NeurIPS
• 2020
An algorithm for reward-free RL in the linear Markov decision process setting where both the transition and the reward admit linear representations is given, and the sample complexity is polynomial in the feature dimension and the planning horizon, and is completely independent of the number of states and actions.
• Computer Science
NeurIPS
• 2021
It is proved that for any reward-free algorithm, it needs to sample at least Ω̃(Hd −2) episodes to obtain an -optimal policy, and a new provably efficient algorithm, called UCRL-RFE is proposed under the Linear Mixture MDP assumption, where the transition probability kernel of the MDP can be parameterized by a linear function over certain feature mappings defined on the triplet of state, action, and next state.
• Computer Science
ICML
• 2020
An efficient algorithm is given that conducts episodes of exploration and returns near-suboptimal policies for an arbitrary number of reward functions, and a nearly-matching $\Omega(S^2AH^2/\epsilon^2)$ lower bound is given, demonstrating the near-optimality of the algorithm in this setting.
• Computer Science
AISTATS
• 2022
An efficient algorithm is provided that takes only Õ(1/ · (HSA/ρ+HSA)) episodes of exploration, and is able to obtain an -optimal policy for a post-revealed reward with sub-optimality gap at least ρ, obtaining a nearly quadratic saving in terms of .
• Computer Science
NeurIPS
• 2020
This work shows how under a more standard notion of low inherent Bellman error, typically employed in least-square value iteration-style algorithms, this algorithm can provide strong PAC guarantees on learning a near optimal value function provided that the linear space is sufficiently "explorable".
• Computer Science
AISTATS
• 2021
A lower bound is provided showing that if the learner has oracle access to a policy that collects well-conditioned data then a variant of Lasso fitted Q-iteration enjoys a nearly dimension-free regret, which shows that in the large-action setting, the difficulty of learning can be attributed to the difficulties of finding a good exploratory policy.
• Computer Science
ICML
• 2022
It is shown that it is possible to obtain regret scaling as (cid:101) O ( ( cid:112) d 3 H 3 · V (cids) 1 · K + d 3 .
• Computer Science
ArXiv
• 2020
A new efficient algorithm is given, which interacts with the environment at most $O\left( \frac{S^2A}{\epsilon^2}\text{poly}\log\left(\frac{SAH}{\Epsilon}\right) \right)$ episodes in the exploration phase, and guarantees to output a near-optimal policy for arbitrary reward functions in the planning phase.
• Computer Science
NeurIPS
• 2018
Q-learning with UCB exploration achieves regret in an episodic MDP setting, and this is the first analysis in the model-free setting that establishes $\sqrt{T}$ regret without requiring access to a "simulator."
• Computer Science
COLT
• 2022
This work shows that there exists a fundamental tradeoff between achieving low regret and identifying an -optimal policy at the instance-optimal rate, and proposes a new measure of instance-dependent sample complexity for PAC tabular reinforcement learning which explicitly accounts for the attainable state visitation distributions in the underlying MDP.