• Corpus ID: 210023574

Reinforcement Learning via Fenchel-Rockafellar Duality

@article{Nachum2020ReinforcementLV,
  title={Reinforcement Learning via Fenchel-Rockafellar Duality},
  author={Ofir Nachum and Bo Dai},
  journal={ArXiv},
  year={2020},
  volume={abs/2001.01866}
}
We review basic concepts of convex duality, focusing on the very general and supremely useful Fenchel-Rockafellar duality. We summarize how this duality may be applied to a variety of reinforcement learning (RL) settings, including policy evaluation or optimization, online or offline learning, and discounted or undiscounted rewards. The derivations yield a number of intriguing results, including the ability to perform policy evaluation and on-policy policy gradient with behavior-agnostic… 

Tables from this paper

Efficient Performance Bounds for Primal-Dual Reinforcement Learning from Demonstrations
TLDR
To bridge the gap between theory and practice, a novel bilinear saddle-point framework using Lagrangian duality is introduced and a model-free provably efficient algorithm is developed through the lens of stochastic convex optimization.
Bellman Residual Orthogonalization for Offline Reinforcement Learning
TLDR
A new reinforcement learning principle that approximates the Bellman equations by enforcing their validity only along an user-defined space of test functions is introduced, and an oracle inequality is proved on the authors' policy optimization procedure in terms of a trade-off between the value and uncertainty of an arbitrary comparator policy.
Near Optimal Policy Optimization via REPS
TLDR
This paper considers the practical setting of stochastic gradients, and introduces a technique that uses generative access to the underlying Markov decision process to compute parameter updates that maintain favorable convergence to the optimal regularized policy.
A Functional Mirror Descent Perspective on Reinforcement Learning
  • Computer Science
  • 2020
TLDR
This work argues for a much tighter integration of the ideas from (online) convex optimization into the design of RL algorithms for continuous MDPs by showing how a number of existing approaches can be framed as approximate mirror descent on the space of probability measures and indicating yet unexplored directions uncovered by this perspective.
Offline Reinforcement Learning with Realizability and Single-policy Concentrability
TLDR
A simple algorithm based on the primal-dual formulation of MDPs, where the dual variables are mod-eled using a density-ratio function against offline data and enjoys polynomial sample complexity, under only realizability and single-policy concentrability.
Variational Policy Gradient Method for Reinforcement Learning with General Utilities
TLDR
A new Variational Policy Gradient Theorem for RL with general utilities is derived, which establishes that the parametrized policy gradient may be obtained as the solution of a stochastic saddle point problem involving the Fenchel dual of the utility function.
The $f$-Divergence Reinforcement Learning Framework
TLDR
This paper presents a novel DRL framework, termed f -Divergence Reinforcement Learning (FRL), which trains agents by minimizing the f−divergence between the learning policy and the sampling policy, which is distinct from conventional DRL algorithms that aim to maximize the expected cumulative rewards.
Combing Policy Evaluation and Policy Improvement in a Unified f-Divergence Framework
TLDR
This paper derives a novel DRL framework, termed f-Divergence Reinforcement Learning (FRL), which achieves two advantages: (1) policy evaluation and policy improvement processes are derived simultaneously by f -divergence; (2) overestimation issue of value function are alleviated.
Convex Regularization in Monte-Carlo Tree Search
TLDR
This paper introduces a unifying theory on the use of generic convex regularizers in MCTS, deriving the regret analysis and providing guarantees of exponential convergence rate and empirically evaluates the proposed operators in AlphaGo and AlphaZero on problems of increasing dimensionality and branching factor.
Marginalized Operators for Off-policy Reinforcement Learning
TLDR
It is shown that the estimates for marginalized operators can be computed in a scalable way, which also generalizes prior results on marginalized importance sampling as special cases.
...
...

References

SHOWING 1-10 OF 62 REFERENCES
A unified view of entropy-regularized Markov decision processes
TLDR
A general framework for entropy-regularized average-reward reinforcement learning in Markov decision processes (MDPs) is proposed, showing that using the conditional entropy of the joint state-action distributions as regularization yields a dual optimization problem closely resembling the Bellman optimality equations.
Efficiently Breaking the Curse of Horizon: Double Reinforcement Learning in Infinite-Horizon Processes
TLDR
A new estimator based on Double Reinforcement Learning (DRL) that leverages this structure for OPE and remains efficient when both are estimated at slow, nonparametric rates and remains consistent when either is estimated consistently.
Dual Representations for Dynamic Programming
TLDR
The dual approach to dynamic programming and reinforcement learning is proposed, based on maintaining an explicit representation of visit distributions as opposed to value functions, a viable alternative to standard dynamic programming techniques and new avenues for developing algorithms for sequential decision making.
A Divergence Minimization Perspective on Imitation Learning Methods
TLDR
A unified probabilistic perspective on IL algorithms based on divergence minimization is presented, conclusively identifying that IRL's state-marginal matching objective contributes most to its superior performance, and applies the new understanding of IL methods to the problem of state-Marginal matching.
Imitation Learning via Off-Policy Distribution Matching
TLDR
This work shows how the original distribution ratio estimation objective may be transformed in a principled manner to yield a completely off-policy objective and calls the resulting algorithm ValueDICE, finding that it can achieve state-of-the-art sample efficiency and performance.
Minimax Weight and Q-Function Learning for Off-Policy Evaluation
TLDR
A new estimator, MWL, is introduced that directly estimates importance ratios over the state-action distributions, removing the reliance on knowledge of the behavior policy as in prior work.
Doubly Robust Off-policy Value Evaluation for Reinforcement Learning
TLDR
This work extends the doubly robust estimator for bandits to sequential decision-making problems, which gets the best of both worlds: it is guaranteed to be unbiased and can have a much lower variance than the popular importance sampling estimators.
Policy Gradient Methods for Reinforcement Learning with Function Approximation
TLDR
This paper proves for the first time that a version of policy iteration with arbitrary differentiable function approximation is convergent to a locally optimal policy.
Learning Robust Rewards with Adversarial Inverse Reinforcement Learning
TLDR
It is demonstrated that AIRL is able to recover reward functions that are robust to changes in dynamics, enabling us to learn policies even under significant variation in the environment seen during training.
A new Q(lambda) with interim forward view and Monte Carlo equivalence
TLDR
A new version of Q(λ) is introduced that does exactly that, without significantly increased algorithmic complexity, and introduces a new derivation technique based on the forward-view/backward-view analysis familiar from TD(λ), but extended to apply at every time step rather than only at the end of episodes.
...
...