• Publications
  • Influence
Data-Efficient Off-Policy Policy Evaluation for Reinforcement Learning
A new way of predicting the performance of a reinforcement learning policy given historical data that may have been generated by a different policy, based on an extension of the doubly robust estimator and a new way to mix between model based estimates and importance sampling based estimates. Expand
High-Confidence Off-Policy Evaluation
This paper proposes an off-policy method for computing a lower confidence bound on the expected return of a policy and provides confidences regarding the accuracy of their estimates. Expand
High Confidence Policy Improvement
We present a batch reinforcement learning (RL) algorithm that provides probabilistic guarantees about the quality of each policy that it proposes, and which has no hyper-parameters that requireExpand
Value Function Approximation in Reinforcement Learning Using the Fourier Basis
The Fourier basis is described, a linear value function approximation scheme based on the Fourier series that performs well compared to radial basis functions and the polynomial basis, and is competitive with learned proto-value functions. Expand
Increasing the Action Gap: New Operators for Reinforcement Learning
An operator for tabular representations is described, the consistent Bellman operator, which incorporates a notion of local policy consistency, which leads to an increase in the action gap at each state; increasing this gap mitigates the undesirable effects of approximation and estimation errors on the induced greedy policies. Expand
Safe Reinforcement Learning
Proximal Reinforcement Learning: A New Theory of Sequential Decision Making in Primal-Dual Spaces
A new vision of reinforcement learning is set forth, one that yields mathematically rigorous solutions to longstanding important questions that have remained unresolved, and proximal operator theory enables the systematic development of operator splitting methods that show how to safely and reliably decompose complex products of gradients. Expand
Personalized Ad Recommendation Systems for Life-Time Value Optimization with Guarantees
Results show that a RL algorithm equipped with these off-policy evaluation techniques outperforms the myopic approaches and give fundamental insights on the difference between the click through rate (CTR) and life-time value (LTV) metrics for evaluating the performance of a PAR algorithm. Expand
Learning Action Representations for Reinforcement Learning
This work provides an algorithm to both learn and use action representations and provide conditions for its convergence and the efficacy of the proposed method is demonstrated on large-scale real-world problems. Expand
Preventing undesirable behavior of intelligent machines
A general framework for algorithm design is introduced in which the burden of avoiding undesirable behavior is shifted from the user to the designer of the algorithm, and this framework simplifies the problem of specifying and regulating undesirable behavior. Expand