• Corpus ID: 239998350

Towards Hyperparameter-free Policy Selection for Offline Reinforcement Learning

@inproceedings{Zhang2021TowardsHP,
  title={Towards Hyperparameter-free Policy Selection for Offline Reinforcement Learning},
  author={Siyuan Zhang and Nan Jiang},
  booktitle={NeurIPS},
  year={2021}
}
How to select between policies and value functions produced by different training algorithms in offline reinforcement learning (RL)—which is crucial for hyperparameter tuning—is an important open question. Existing approaches based on off-policy evaluation (OPE) often require additional function approximation and hence hyperparameters, creating a chicken-and-egg situation. In this paper, we design hyperparameter-free algorithms for policy selection based on BVFT [XJ21], a recent theoretical… 
Hyperparameter Selection Methods for Fitted Q-Evaluation with Error Guarantee
TLDR
A framework of approximate hyperparameter selection (AHS) for FQE is proposed, which defines a notion of optimality in a quantitative and interpretable manner without hyperparameters and derives four AHS methods each of which has different characteristics such as distribution-mismatch tolerance and time complexity.
Exploiting Action Impact Regularity and Exogenous State Variables for Offline Reinforcement Learning
TLDR
This work proposes an algorithm that exploits the AIR property, and provides theoretical guarantees for the outputted policy, and demonstrates that the algorithm outperforms existing offline reinforcement learning algorithms across different data collection policies in simulated and real world environments where the regularity holds.
Adversarially Trained Actor Critic for Offline Reinforcement Learning
TLDR
It is proved that, when the actor attains no regret in the two-player game, running ATAC produces a policy that provably outperforms the behavior policy over a wide range of hyperparameters, and competes with the best policy covered by data with appropriately chosenhyperparameters.
Model Selection in Batch Policy Optimization
TLDR
This work identifies three sources of error that any model selection algorithm should optimally trade-off in order to be competitive and shows that relaxing any one of the three error sources enables the design of algorithms achieving near-oracle inequalities for the remaining two.
RAMBO-RL: Robust Adversarial Model-Based Offline Reinforcement Learning
TLDR
RamBO is presented, a novel approach to model-based offline RL that addresses the problem as a two-player zero sum game against an adversarial environment model, resulting in a PAC performance guarantee and a pessimistic value function which lower bounds the value function in the true environment.
Don't Change the Algorithm, Change the Data: Exploratory Data for Offline Reinforcement Learning
TLDR
It is suggested that data generation is as important as algorithmic advances for offline RL and hence requires careful consideration from the committee, and that exploratory data allows vanilla off-policy RL algorithms to outperform or match state-of-the-art ofﵡine RL algorithms on downstream tasks.
Incorporating Explicit Uncertainty Estimates into Deep Offline Reinforcement Learning
TLDR
This work develops a novel method for incorporating scalable uncertainty estimates into an offline reinforcement learning algorithm called deep-SPIBB that extends the SPIBB family of algorithms to environments with larger state and action spaces and argues that the SPibB mechanism for incorporating uncertainty is more robust and flexible than pessimistic approaches that incorporate the uncertainty as a value function penalty.
How Far I'll Go: Offline Goal-Conditioned Reinforcement Learning via f-Advantage Regression
TLDR
It is demonstrated that GoFAR’s training objectives can be re-purposed to learn an agent-independent goal-conditioned planner from purely offline source-domain data, which enables zero-shot transfer to new target domains and significantly outperforming prior state-of-art methods.
Reinforcement Learning in Practice: Opportunities and Challenges
This article is a gentle discussion about the field of reinforcement learning in practice, about opportunities and challenges, touching a broad range of topics, with perspectives and without technical
Beyond the Return: Off-policy Function Estimation under User-specified Error-measuring Distributions
  • Computer Science, Mathematics
  • 2022
Off-policy evaluation often refers to two related tasks: estimating the expected return of a policy and estimating its value function (or other functions of interest, such as density ratios). While

References

SHOWING 1-10 OF 41 REFERENCES
Off-Policy Deep Reinforcement Learning without Exploration
TLDR
This paper introduces a novel class of off-policy algorithms, batch-constrained reinforcement learning, which restricts the action space in order to force the agent towards behaving close to on-policy with respect to a subset of the given data.
Hyperparameter Selection for Offline Reinforcement Learning
TLDR
This work focuses on offline hyperparameter selection, i.e. methods for choosing the best policy from a set of many policies trained using different hyperparameters, given only logged data, and shows that offline RL algorithms are not robust tohyperparameter choices.
Conservative Q-Learning for Offline Reinforcement Learning
TLDR
Conservative Q-learning (CQL) is proposed, which aims to address limitations of offline RL methods by learning a conservative Q-function such that the expected value of a policy under this Q- function lower-bounds its true value.
Provably Good Batch Reinforcement Learning Without Great Exploration
TLDR
It is shown that a small modification to Bellman optimality and evaluation back-up to take a more conservative update can have much stronger guarantees on the performance of the output policy, and in certain settings, they can find the approximately best policy within the state-action space explored by the batch data, without requiring a priori assumptions of concentrability.
Minimax Weight and Q-Function Learning for Off-Policy Evaluation
TLDR
A new estimator, MWL, is introduced that directly estimates importance ratios over the state-action distributions, removing the reliance on knowledge of the behavior policy as in prior work.
Doubly Robust Off-policy Value Evaluation for Reinforcement Learning
TLDR
This work extends the doubly robust estimator for bandits to sequential decision-making problems, which gets the best of both worlds: it is guaranteed to be unbiased and can have a much lower variance than the popular importance sampling estimators.
Batch Policy Learning under Constraints
TLDR
A new and simple method for off-policy policy evaluation (OPE) and derive PAC-style bounds is proposed and achieves strong empirical results in different domains, including in a challenging problem of simulated car driving subject to multiple constraints such as lane keeping and smooth driving.
Model Selection in Reinforcement Learning
TLDR
A complexity regularization-based model selection algorithm is proposed and its adaptivity is proved : the procedure is shown to perform almost as well as if the best parameter setting was known ahead of time.
MOReL : Model-Based Offline Reinforcement Learning
TLDR
Theoretically, it is shown that MOReL is minimax optimal (up to log factors) for offline RL, and through experiments, it matches or exceeds state-of-the-art results in widely studied offline RL benchmarks.
Benchmarking Batch Deep Reinforcement Learning Algorithms
TLDR
This paper benchmark the performance of recent off-policy and batch reinforcement learning algorithms under unified settings on the Atari domain, with data generated by a single partially-trained behavioral policy, and finds that many of these algorithms underperform DQN trained online with the same amount of data.
...
...