Corpus ID: 3178672

Variance Adjusted Actor Critic Algorithms

@article{Tamar2013VarianceAA,
  title={Variance Adjusted Actor Critic Algorithms},
  author={Aviv Tamar and Shie Mannor},
  journal={ArXiv},
  year={2013},
  volume={abs/1310.3697}
}
We present an actor-critic framework for MDPs where the objective is the variance-adjusted expected return. Our critic uses linear function approximation, and we extend the concept of compatible features to the variance-adjusted setting. We present an episodic actor-critic algorithm and show that it converges almost surely to a locally optimal point of the objective function. 
Variance Penalized On-Policy and Off-Policy Actor-Critic
TLDR
This work addresses the former source of variability in an RL setup via mean-variance optimization and modify the standard policy gradient objective to include a direct variance estimator for learning policies that maximize the variance-penalized return. Expand
Variance-constrained actor-critic algorithms for discounted and average reward MDPs
TLDR
This paper considers both discounted and average reward Markov decision processes and devise actor-critic algorithms that operate on three timescales—a TD critic on the fastest timescale, a policy gradient (actor) on the intermediate timescale), and a dual ascent for Lagrange multipliers on the slowest timescale. Expand
Risk-Sensitive Deep RL: Variance-Constrained Actor-Critic Provably Finds Globally Optimal Policy
TLDR
This work makes the first attempt to study risk-sensitive deep reinforcement learning under the average reward setting with the variance risk criteria, and proposes an actor-critic algorithm that iteratively and efficiently updates the policy, the Lagrange multiplier, and the Fenchel dual variable. Expand
Continuous-Time Mean-Variance Portfolio Selection: A Reinforcement Learning Framework
TLDR
This work establishes connections between the entropy-regularized MV and the classical MV, including the solvability equivalence and the convergence as exploration weighting parameter decays to zero, and proves a policy improvement theorem, based on which an implementable RL algorithm is devised. Expand
Continuous‐Time Mean–Variance Portfolio Selection: A Reinforcement Learning Framework
TLDR
This work establishes connections between the entropy-regularized MV and the classical MV, including the solvability equivalence and the convergence as exploration weighting parameter decays to zero, and proves a policy improvement theorem, based on which an implementable RL algorithm is devised. Expand
Continuous-Time Mean-Variance Portfolio Optimization via Reinforcement Learning
TLDR
The PIT leads to an implementable RL algorithm that outperforms an adaptive control based method that estimates the underlying parameters in real-time and a state-of-the-art RL method that uses deep neural networks for continuous control problems by a large margin in nearly all simulations. Expand
Reward Constrained Policy Optimization
TLDR
This work presents a novel multi-timescale approach for constrained policy optimization, called `Reward Constrained Policy Optimization' (RCPO), which uses an alternative penalty signal to guide the policy towards a constraint satisfying one. Expand
Risk-Constrained Reinforcement Learning with Percentile Risk Criteria
TLDR
This paper derives a formula for computing the gradient of the Lagrangian function for percentile risk-constrained Markov decision processes and devise policy gradient and actor-critic algorithms that estimate such gradient, update the policy in the descent direction, and update the Lagrange multiplier in the ascent direction. Expand
Directly Estimating the Variance of the {\lambda}-Return Using Temporal-Difference Methods
TLDR
This paper investigates estimating the variance of a temporal-difference learning agent's update target using policy evaluation methods from reinforcement learning, contributing a method significantly simpler than prior methods that independently estimate the second moment of the {\lambda}-return. Expand
Directly Estimating the Variance of the λ-Return Using Temporal-Difference Methods
TLDR
A method for estimating the variance of the λ-return directly using policy evaluation methods from reinforcement learning is contributed, significantly simpler than prior methods that independently estimate the second moment of the €return. Expand
...
1
2
3
...

References

SHOWING 1-10 OF 26 REFERENCES
Temporal Difference Methods for the Variance of the Reward To Go
TLDR
This paper proposes variants of both TD(0) and LSTD(λ) with linear function approximation, proves their convergence, and demonstrates their utility in a 4-dimensional continuous state space problem. Expand
Actor-Critic Algorithms for Risk-Sensitive MDPs
TLDR
This paper considers both discounted and average reward Markov decision processes and devise actor-critic algorithms for estimating the gradient and updating the policy parameters in the ascent direction, which establish the convergence of the algorithms to locally risk-sensitive optimal policies. Expand
Natural Actor-Critic
This paper investigates a novel model-free reinforcement learning architecture, the Natural Actor-Critic. The actor updates are based on stochastic policy gradients employing Amari's natural gradientExpand
A Survey of Actor-Critic Reinforcement Learning: Standard and Natural Policy Gradients
TLDR
The workings of the natural gradient is described, which has made its way into many actor-critic algorithms over the past few years, and a review of several standard and natural actor-Critic algorithms is given. Expand
Policy Gradients with Variance Related Risk Criteria
TLDR
A framework for local policy gradient style algorithms for reinforcement learning for variance related criteria for policy gradient algorithms for criteria that involve both the expected cost and the variance of the cost. Expand
Policy Gradient Methods for Reinforcement Learning with Function Approximation
TLDR
This paper proves for the first time that a version of policy iteration with arbitrary differentiable function approximation is convergent to a locally optimal policy. Expand
TD algorithm for the variance of return and mean-variance reinforcement learning
TLDR
A TD algorithm for estimating the variance of return in MDP(Markov decision processes) environments and a gradient-based reinforcement learning algorithm on the variance penalized criterion, which is a typical criterion in risk-avoiding control are presented. Expand
Algorithmic aspects of mean-variance optimization in Markov decision processes
TLDR
It is proved that the complexity of computing a policy that maximizes the mean reward under a variance constraint is NP-hard for some cases, and strongly NP- hard for others. Expand
Simulation-based optimization of Markov reward processes
TLDR
This paper proposes a simulation-based algorithm for optimizing the average reward in a finite-state Markov reward process that depends on a set of parameters and relies on the regenerative structure of finite- state Markov processes. Expand
Neuro-Dynamic Programming
From the Publisher: This is the first textbook that fully explains the neuro-dynamic programming/reinforcement learning methodology, which is a recent breakthrough in the practical application ofExpand
...
1
2
3
...