• Corpus ID: 234790445

A Stochastic Composite Augmented Lagrangian Method For Reinforcement Learning

@article{Li2021ASC,
  title={A Stochastic Composite Augmented Lagrangian Method For Reinforcement Learning},
  author={Yongfeng Li and Mingming Zhao and Weijie Chen and Zaiwen Wen},
  journal={ArXiv},
  year={2021},
  volume={abs/2105.09716}
}
In this paper, we consider the linear programming (LP) formulation for deep reinforcement learning. The number of the constraints depends on the size of state and action spaces, which makes the problem intractable in large or continuous environments. The general augmented Lagrangian method suffers the double-sampling obstacle in solving the LP. Namely, the conditional expectations originated from the constraint functions and the quadratic penalties in the augmented Lagrangian function impose… 

Figures from this paper

A Near-Optimal Primal-Dual Method for Off-Policy Learning in CMDP

A near-optimal primal-dual learning algorithm called DPDL is proposed that provably guarantees zero constraint violation and its sample complexity matches the above lower bound except for an ˜ O ((1 − γ ) − 1 ) factor.

References

SHOWING 1-10 OF 30 REFERENCES

Deep Primal-Dual Reinforcement Learning: Accelerating Actor-Critic using Bellman Duality

A parameterized Primal-Dual $\pi$ Learning method based on deep neural networks for Markov decision process with large state space and off-policy reinforcement learning that significantly outperforms the one-step temporal-difference actor-critic method.

SBEED: Convergent Reinforcement Learning with Nonlinear Function Approximation

This paper revisits the Bellman equation, and reformulate it into a novel primal-dual optimization problem using Nesterov’s smoothing technique and the Legendre-Fenchel transformation, and develops a new algorithm, called Smoothed Bellman Error Embedding, to solve this optimization problem where any differentiable function class may be used.

Borrowing From the Future: Addressing Double Sampling in Model-free Control

It is proved that BFF is close to unbiased SGD when the underlying dynamics vary slowly with respect to actions, which is similar to unbiased stochastic gradient descent.

Continuous control with deep reinforcement learning

This work presents an actor-critic, model-free algorithm based on the deterministic policy gradient that can operate over continuous action spaces, and demonstrates that for many of the tasks the algorithm can learn policies end-to-end: directly from raw pixel inputs.

Boosting the Actor with Dual Critic

This paper proposes a new actor-critic-style algorithm called Dual Actor-Critic or Dual-AC, derived in a principled way from the Lagrangian dual form of the Bellman optimality equation, providing a more transparent way for learning the critic that is directly related to the objective function of the actor.

Neural Proximal/Trust Region Policy Optimization Attains Globally Optimal Policy

It is proved that a variant of PPO and TRPO equipped with overparametrized neural networks converges to the globally optimal policy at a sublinear rate.

Reinforcement Learning: An Introduction

This book provides a clear and simple account of the key ideas and algorithms of reinforcement learning, which ranges from the history of the field's intellectual foundations to the most recent developments and applications.

Adam: A Method for Stochastic Optimization

This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.

Augmented Lagrangians and Applications of the Proximal Point Algorithm in Convex Programming

The theory of the proximal point algorithm for maximal monotone operators is applied to three algorithms for solving convex programs, one of which has not previously been formulated and is shown to have much the same convergence properties, but with some potential advantages.

Proximal Policy Optimization Algorithms

We propose a new family of policy gradient methods for reinforcement learning, which alternate between sampling data through interaction with the environment, and optimizing a "surrogate" objective