• Publications
  • Influence
An Optimistic Perspective on Offline Reinforcement Learning
TLDR
It is demonstrated that recent off-policy deep RL algorithms, even when trained solely on this replay dataset, outperform the fully trained DQN agent and Random Ensemble Mixture (REM), a robust Q-learning algorithm that enforces optimal Bellman consistency on random convex combinations of multiple Q-value estimates is presented.
AlgaeDICE: Policy Gradient from Arbitrary Experience
TLDR
A new formulation of max-return optimization that allows the problem to be re-expressed by an expectation over an arbitrary behavior-agnostic and off-policy data distribution and shows that, if auxiliary dual variables of the objective are optimized, then the gradient of the off-Policy objective is exactly the on-policy policy gradient, without any use of importance weighting.
On the Global Convergence Rates of Softmax Policy Gradient Methods
TLDR
It is shown that with the true gradient, policy gradient with a softmax parametrization converges at a $O(1/t)$ rate, with constants depending on the problem and initialization, which significantly expands the recent asymptotic convergence results.
GenDICE: Generalized Offline Estimation of Stationary Values
TLDR
This work proves the consistency of the method under general conditions, provides a detailed error analysis, and demonstrates strong empirical performance on benchmark tasks, including off-line PageRank and off-policy policy evaluation.
Systolic Peak Detection in Acceleration Photoplethysmograms Measured from Emergency Responders in Tropical Conditions
TLDR
A novel algorithm is proposed that can detect systolic peaks under challenging conditions, as in the case of emergency responders in tropical conditions, and presents an advantage for real-time applications by avoiding human intervention in threshold determination.
Domain Aggregation Networks for Multi-Source Domain Adaptation
TLDR
The algorithm developed, Domain AggRegation Network (DARN), is able to effectively adjust the weight of each source domain during training to ensure relevant domains are given more importance for adaptation.
Off-Policy Evaluation via the Regularized Lagrangian
TLDR
The unification of DICE estimators as regularized Lagrangians of the same linear program finds that dual solutions offer greater flexibility in navigating the tradeoff between optimization stability and estimation bias, and generally provide superior estimates in practice.
EMaQ: Expected-Max Q-Learning Operator for Simple Yet Effective Offline and Online RL
TLDR
A novel backup operator, Expected-Max Q-Learning (EMaQ), which naturally restricts learned policies to remain within the support of the offline dataset without any explicit regularization, while retaining desirable theoretical properties such as contraction is presented.
CoinDICE: Off-Policy Confidence Interval Estimation
TLDR
This work proposes CoinDICE, a novel and efficient algorithm for computing confidence intervals in high-confidence behavior-agnostic off-policy evaluation in reinforcement learning, and proves the obtained confidence intervals are valid, in both asymptotic and finite-sample regimes.
Variational Rejection Sampling
TLDR
This work proposes a novel rejection sampling step that discards samples from the variational posterior which are assigned low likelihoods by the model, providing an arbitrarily accurate approximation of the true posterior at the expense of extra computation.
...
...