• Corpus ID: 3178672

# Variance Adjusted Actor Critic Algorithms

@article{Tamar2013VarianceAA,
author={Aviv Tamar and Shie Mannor},
journal={ArXiv},
year={2013},
volume={abs/1310.3697}
}
• Published 14 October 2013
• Computer Science, Mathematics
• ArXiv
We present an actor-critic framework for MDPs where the objective is the variance-adjusted expected return. Our critic uses linear function approximation, and we extend the concept of compatible features to the variance-adjusted setting. We present an episodic actor-critic algorithm and show that it converges almost surely to a locally optimal point of the objective function.

### Variance Penalized On-Policy and Off-Policy Actor-Critic

• Computer Science
AAAI
• 2021
This paper proposes on-policy and off-policy actor-critic algorithms that optimize a performance criterion involving both mean and variance in the return, and uses a much simpler recently proposed direct variance estimator which updates the estimates incrementally using temporal difference methods.

### Continuous‐time mean–variance portfolio selection: A reinforcement learning framework

• Computer Science
Mathematical Finance
• 2020
It is proved that the optimal feedback policy for this problem must be Gaussian, with time‐decaying variance, which is then proved to be a policy improvement theorem, based on which an implementable RL algorithm is devised.

### Variance-constrained actor-critic algorithms for discounted and average reward MDPs

• Computer Science
Machine Learning
• 2016
This paper considers both discounted and average reward Markov decision processes and devise actor-critic algorithms that operate on three timescales—a TD critic on the fastest timescale, a policy gradient (actor) on the intermediate timescale), and a dual ascent for Lagrange multipliers on the slowest timescale.

### Risk-Sensitive Deep RL: Variance-Constrained Actor-Critic Provably Finds Globally Optimal Policy

• Computer Science
ArXiv
• 2020
This work makes the first attempt to study risk-sensitive deep reinforcement learning under the average reward setting with the variance risk criteria, and proposes an actor-critic algorithm that iteratively and efficiently updates the policy, the Lagrange multiplier, and the Fenchel dual variable.

### Continuous-Time Mean-Variance Portfolio Selection: A Reinforcement Learning Framework

• Computer Science
SSRN Electronic Journal
• 2019
This work establishes connections between the entropy-regularized MV and the classical MV, including the solvability equivalence and the convergence as exploration weighting parameter decays to zero, and proves a policy improvement theorem, based on which an implementable RL algorithm is devised.

### Model-Based Actor-Critic with Chance Constraint for Stochastic System

• Computer Science
2021 60th IEEE Conference on Decision and Control (CDC)
• 2021
Experiments indicate that CCAC achieves good performance while guaranteeing safety, with a five times faster convergence rate compared with model-free RL methods, and has 100 times higher online computation efficiency than traditional safety techniques such as stochastic model predictive control.

### Continuous-Time Mean-Variance Portfolio Optimization via Reinforcement Learning

• Computer Science
ArXiv
• 2019
The PIT leads to an implementable RL algorithm that outperforms an adaptive control based method that estimates the underlying parameters in real-time and a state-of-the-art RL method that uses deep neural networks for continuous control problems by a large margin in nearly all simulations.

### Reward Constrained Policy Optimization

• Computer Science
ICLR
• 2019
This work presents a novel multi-timescale approach for constrained policy optimization, called `Reward Constrained Policy Optimization' (RCPO), which uses an alternative penalty signal to guide the policy towards a constraint satisfying one.

### Risk-Constrained Reinforcement Learning with Percentile Risk Criteria

• Computer Science
J. Mach. Learn. Res.
• 2017
This paper derives a formula for computing the gradient of the Lagrangian function for percentile risk-constrained Markov decision processes and devise policy gradient and actor-critic algorithms that estimate such gradient, update the policy in the descent direction, and update the Lagrange multiplier in the ascent direction.

### Directly Estimating the Variance of the {\lambda}-Return Using Temporal-Difference Methods

• Computer Science
• 2018
This paper investigates estimating the variance of a temporal-difference learning agent's update target using policy evaluation methods from reinforcement learning, contributing a method significantly simpler than prior methods that independently estimate the second moment of the {\lambda}-return.

## References

SHOWING 1-10 OF 19 REFERENCES

### Actor-Critic Algorithms for Risk-Sensitive MDPs

• Computer Science
NIPS
• 2013
This paper considers both discounted and average reward Markov decision processes and devise actor-critic algorithms for estimating the gradient and updating the policy parameters in the ascent direction, which establish the convergence of the algorithms to locally risk-sensitive optimal policies.

### Natural Actor-Critic

• Computer Science
Neurocomputing
• 2008

### A Survey of Actor-Critic Reinforcement Learning: Standard and Natural Policy Gradients

• Computer Science
IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews)
• 2012
The workings of the natural gradient is described, which has made its way into many actor-critic algorithms over the past few years, and a review of several standard and natural actor-Critic algorithms is given.

### Policy Gradient Methods for Reinforcement Learning with Function Approximation

• Computer Science
NIPS
• 1999
This paper proves for the first time that a version of policy iteration with arbitrary differentiable function approximation is convergent to a locally optimal policy.

### Simulation-based optimization of Markov reward processes

• Computer Science
Proceedings of the 37th IEEE Conference on Decision and Control (Cat. No.98CH36171)
• 1998
A simulation-based algorithm for optimizing the average reward in a Markov reward process that depends on a set of parameters where optimization takes place within a parametrized set of policies is proposed.

### TD algorithm for the variance of return and mean-variance reinforcement learning

• Computer Science
• 2001
A TD algorithm for estimating the variance of return in MDP(Markov decision processes) environments and a gradient-based reinforcement learning algorithm on the variance penalized criterion, which is a typical criterion in risk-avoiding control are presented.

### Neuro-Dynamic Programming

• D. Bertsekas
• Computer Science
Encyclopedia of Optimization
• 2009
From the Publisher: This is the first textbook that fully explains the neuro-dynamic programming/reinforcement learning methodology, which is a recent breakthrough in the practical application of

### Convergent multiple-timescales reinforcement learning algorithms in normal form games

• Computer Science
• 2003
Using two-timescales stochastic approximation, a model-free algorithm is introduced which is asymptotically equivalent to the smooth fictitious play algorithm, in that both result in asymPTotic pseudotrajectories to the flow defined by the smooth best response dynamics.

### Actor-Critic Algorithms

• Computer Science
NIPS
• 1999
This thesis proposes and studies actor-critic algorithms which combine the above two approaches with simulation to find the best policy among a parameterized class of policies, and proves convergence of the algorithms for problems with general state and decision spaces.