• Corpus ID: 211677657

Policy-Aware Model Learning for Policy Gradient Methods

@article{Abachi2020PolicyAwareML,
  title={Policy-Aware Model Learning for Policy Gradient Methods},
  author={Romina Abachi and Mohammad Ghavamzadeh and Amir-massoud Farahmand},
  journal={ArXiv},
  year={2020},
  volume={abs/2003.00030}
}
This paper considers the problem of learning a model in model-based reinforcement learning (MBRL). We examine how the planning module of an MBRL algorithm uses the model, and propose that the model learning module should incorporate the way the planner is going to use the model. This is in contrast to conventional model learning approaches, such as those based on maximum likelihood estimate, that learn a predictive model of the environment without explicitly considering the interaction of the… 
Variational Model-based Policy Optimization
TLDR
This paper proposes model-based and model-free policy iteration (actor-critic) style algorithms for the E-step and shows how the variational distribution learned by them can be used to optimize the M-step in a fully model- based fashion.
The Value Equivalence Principle for Model-Based Reinforcement Learning
TLDR
It is argued that the limited representational resources of model-based RL agents are better used to build models that are directly useful for value-based planning, and the principle of value equivalence underlies a number of recent empirical successes in RL.
Control-Aware Representations for Model-based Reinforcement Learning
TLDR
This paper forms a LCE model to learn representations that are suitable to be used by a policy iteration style algorithm in the latent space, and derives a loss function for CARL that has close connection to the prediction, consistency, and curvature (PCC) principle for representation learning.
Complex Momentum for Learning in Games
TLDR
It is empirically demonstrate that complex-valued momentum can improve convergence in adversarial games—like generative adversarial networks—by showing it can find better solutions with an almost identical computational cost.
Forethought and Hindsight in Credit Assignment
We address the problem of credit assignment in reinforcement learning and explore fundamental questions regarding the way in which an agent can best use additional computation to propagate new
Applications of reinforcement learning in energy systems
TLDR
The present study clearly demonstrates that even without the full utilization of RL capacity, this technique has a considerable potential in resolving the continuously increasing complexity within the energy system domain.
Control-Oriented Model-Based Reinforcement Learning with Implicit Differentiation
TLDR
This work proposes an end-to-end approach for model learning which directly optimizes the expected returns using implicit differentiation and provides theoretical and empirical evidence highlighting the benefits of this approach in the model misspecification regime compared to likelihood-based methods.
Minimax Model Learning
TLDR
A novel off-policy loss function for learning a transition model in model-based reinforcement learning that allows for greater robustness under model misspecification or distribution shift induced by learning/evaluating policies that are distinct from the data-generating policy.
Weighted model estimation for offline model-based reinforcement learning
This paper discusses model estimation in offline model-based reinforcement learning (MBRL), which is important for subsequent policy improvement using an estimated model. From the viewpoint of
Decision-Aware Model Learning for Actor-Critic Methods: When Theory Does Not Meet Practice
Actor-Critic methods are a prominent class of modern reinforcement learning algorithms based on the classic Policy Iteration procedure. Despite many successful cases, Actor-Critic methods tend to

References

SHOWING 1-10 OF 85 REFERENCES
Gradient-Aware Model-based Policy Search
TLDR
A novel model-based policy search approach that exploits the knowledge of the current agent policy to learn an approximate transition model, focusing on the portions of the environment that are most relevant for policy improvement.
Iterative Value-Aware Model Learning
TLDR
A new model-based reinforcement learning (MBRL) framework that incorporates the underlying decision problem in learning the transition model of the environment, called Iterative VAML, that benefits from the structure of how the planning is performed (i.e., through approximate value iteration) to devise a simpler optimization problem.
A Survey on Policy Search for Robotics
TLDR
This work classifies model-free methods based on their policy evaluation strategy, policy update strategy, and exploration strategy and presents a unified view on existing algorithms.
Algorithmic Framework for Model-based Reinforcement Learning with Theoretical Guarantees
TLDR
A novel algorithmic framework for designing and analyzing model-based RL algorithms with theoretical guarantees is introduced and a meta-algorithm with a theoretical guarantee of monotone improvement to a local maximum of the expected reward is designed.
Model-Ensemble Trust-Region Policy Optimization
TLDR
This paper analyzes the behavior of vanilla model-based reinforcement learning methods when deep neural networks are used to learn both the model and the policy, and shows that the learned policy tends to exploit regions where insufficient data is available for the model to be learned, causing instability in training.
MODEL-ENSEMBLE TRUST-REGION POLICY OPTI-
  • 2017
Model-free reinforcement learning (RL) methods are succeeding in a growing number of tasks, aided by recent advances in deep learning. They tend to suffer from high sample complexity, however, which
Value-Aware Loss Function for Model-based Reinforcement Learning
TLDR
This work argues that estimating a generative model that minimizes a probabilistic loss, such as the log-loss, is an overkill because it does not take into account the underlying structure of decision problem and the RL algorithm that intends to solve it.
Least-Squares Policy Iteration
TLDR
The new algorithm, least-squares policy iteration (LSPI), learns the state-action value function which allows for action selection without a model and for incremental policy improvement within a policy-iteration framework.
Infinite-Horizon Policy-Gradient Estimation
TLDR
GPOMDP, a simulation-based algorithm for generating a biased estimate of the gradient of the average reward in Partially Observable Markov Decision Processes (POMDPs) controlled by parameterized stochastic policies, is introduced.
Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models
TLDR
This paper proposes a new algorithm called probabilistic ensembles with trajectory sampling (PETS) that combines uncertainty-aware deep network dynamics models with sampling-based uncertainty propagation, which matches the asymptotic performance of model-free algorithms on several challenging benchmark tasks, while requiring significantly fewer samples.
...
1
2
3
4
5
...