# Variance Adjusted Actor Critic Algorithms

@article{Tamar2013VarianceAA, title={Variance Adjusted Actor Critic Algorithms}, author={Aviv Tamar and Shie Mannor}, journal={ArXiv}, year={2013}, volume={abs/1310.3697} }

We present an actor-critic framework for MDPs where the objective is the variance-adjusted expected return. Our critic uses linear function approximation, and we extend the concept of compatible features to the variance-adjusted setting. We present an episodic actor-critic algorithm and show that it converges almost surely to a locally optimal point of the objective function.

## 30 Citations

### Variance Penalized On-Policy and Off-Policy Actor-Critic

- Computer ScienceAAAI
- 2021

This paper proposes on-policy and off-policy actor-critic algorithms that optimize a performance criterion involving both mean and variance in the return, and uses a much simpler recently proposed direct variance estimator which updates the estimates incrementally using temporal difference methods.

### Continuous‐time mean–variance portfolio selection: A reinforcement learning framework

- Computer ScienceMathematical Finance
- 2020

It is proved that the optimal feedback policy for this problem must be Gaussian, with time‐decaying variance, which is then proved to be a policy improvement theorem, based on which an implementable RL algorithm is devised.

### Variance-constrained actor-critic algorithms for discounted and average reward MDPs

- Computer ScienceMachine Learning
- 2016

This paper considers both discounted and average reward Markov decision processes and devise actor-critic algorithms that operate on three timescales—a TD critic on the fastest timescale, a policy gradient (actor) on the intermediate timescale), and a dual ascent for Lagrange multipliers on the slowest timescale.

### Risk-Sensitive Deep RL: Variance-Constrained Actor-Critic Provably Finds Globally Optimal Policy

- Computer ScienceArXiv
- 2020

This work makes the first attempt to study risk-sensitive deep reinforcement learning under the average reward setting with the variance risk criteria, and proposes an actor-critic algorithm that iteratively and efficiently updates the policy, the Lagrange multiplier, and the Fenchel dual variable.

### Continuous-Time Mean-Variance Portfolio Selection: A Reinforcement Learning Framework

- Computer ScienceSSRN Electronic Journal
- 2019

This work establishes connections between the entropy-regularized MV and the classical MV, including the solvability equivalence and the convergence as exploration weighting parameter decays to zero, and proves a policy improvement theorem, based on which an implementable RL algorithm is devised.

### Model-Based Actor-Critic with Chance Constraint for Stochastic System

- Computer Science2021 60th IEEE Conference on Decision and Control (CDC)
- 2021

Experiments indicate that CCAC achieves good performance while guaranteeing safety, with a five times faster convergence rate compared with model-free RL methods, and has 100 times higher online computation efficiency than traditional safety techniques such as stochastic model predictive control.

### Continuous-Time Mean-Variance Portfolio Optimization via Reinforcement Learning

- Computer ScienceArXiv
- 2019

The PIT leads to an implementable RL algorithm that outperforms an adaptive control based method that estimates the underlying parameters in real-time and a state-of-the-art RL method that uses deep neural networks for continuous control problems by a large margin in nearly all simulations.

### Reward Constrained Policy Optimization

- Computer ScienceICLR
- 2019

This work presents a novel multi-timescale approach for constrained policy optimization, called `Reward Constrained Policy Optimization' (RCPO), which uses an alternative penalty signal to guide the policy towards a constraint satisfying one.

### Risk-Constrained Reinforcement Learning with Percentile Risk Criteria

- Computer ScienceJ. Mach. Learn. Res.
- 2017

This paper derives a formula for computing the gradient of the Lagrangian function for percentile risk-constrained Markov decision processes and devise policy gradient and actor-critic algorithms that estimate such gradient, update the policy in the descent direction, and update the Lagrange multiplier in the ascent direction.

### Directly Estimating the Variance of the {\lambda}-Return Using Temporal-Difference Methods

- Computer Science
- 2018

This paper investigates estimating the variance of a temporal-difference learning agent's update target using policy evaluation methods from reinforcement learning, contributing a method significantly simpler than prior methods that independently estimate the second moment of the {\lambda}-return.

## References

SHOWING 1-10 OF 19 REFERENCES

### Actor-Critic Algorithms for Risk-Sensitive MDPs

- Computer ScienceNIPS
- 2013

This paper considers both discounted and average reward Markov decision processes and devise actor-critic algorithms for estimating the gradient and updating the policy parameters in the ascent direction, which establish the convergence of the algorithms to locally risk-sensitive optimal policies.

### A Survey of Actor-Critic Reinforcement Learning: Standard and Natural Policy Gradients

- Computer ScienceIEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews)
- 2012

The workings of the natural gradient is described, which has made its way into many actor-critic algorithms over the past few years, and a review of several standard and natural actor-Critic algorithms is given.

### Policy Gradient Methods for Reinforcement Learning with Function Approximation

- Computer ScienceNIPS
- 1999

This paper proves for the first time that a version of policy iteration with arbitrary differentiable function approximation is convergent to a locally optimal policy.

### Simulation-based optimization of Markov reward processes

- Computer ScienceProceedings of the 37th IEEE Conference on Decision and Control (Cat. No.98CH36171)
- 1998

A simulation-based algorithm for optimizing the average reward in a Markov reward process that depends on a set of parameters where optimization takes place within a parametrized set of policies is proposed.

### TD algorithm for the variance of return and mean-variance reinforcement learning

- Computer Science
- 2001

A TD algorithm for estimating the variance of return in MDP(Markov decision processes) environments and a gradient-based reinforcement learning algorithm on the variance penalized criterion, which is a typical criterion in risk-avoiding control are presented.

### Algorithmic aspects of mean-variance optimization in Markov decision processes

- Computer ScienceEur. J. Oper. Res.
- 2013

### Neuro-Dynamic Programming

- Computer ScienceEncyclopedia of Optimization
- 2009

From the Publisher:
This is the first textbook that fully explains the neuro-dynamic programming/reinforcement learning methodology, which is a recent breakthrough in the practical application of…

### Convergent multiple-timescales reinforcement learning algorithms in normal form games

- Computer Science
- 2003

Using two-timescales stochastic approximation, a model-free algorithm is introduced which is asymptotically equivalent to the smooth fictitious play algorithm, in that both result in asymPTotic pseudotrajectories to the flow defined by the smooth best response dynamics.

### Actor-Critic Algorithms

- Computer ScienceNIPS
- 1999

This thesis proposes and studies actor-critic algorithms which combine the above two approaches with simulation to find the best policy among a parameterized class of policies, and proves convergence of the algorithms for problems with general state and decision spaces.