Concurrent Credit Assignment for Data-efficient Reinforcement Learning

@article{Dauce2022ConcurrentCA,
  title={Concurrent Credit Assignment for Data-efficient Reinforcement Learning},
  author={Emmanuel Dauc'e},
  journal={ArXiv},
  year={2022},
  volume={abs/2205.12020}
}
—The capability to widely sample the state and action spaces is a key ingredient toward building effective reinforcement learning algorithms. The variational optimization principles exposed in this paper emphasize the importance of an occupancy model to synthesizes the general distribution of the agent’s environmental states over which it can act (defining a virtual “territory”). The occupancy model is the subject of frequent updates as the exploration progresses and that new states are… 

Figures from this paper

References

SHOWING 1-10 OF 35 REFERENCES
Efficient Exploration via State Marginal Matching
TLDR
This work recast exploration as a problem of State Marginal Matching (SMM), where it is demonstrated that agents that directly optimize the SMM objective explore faster and adapt more quickly to new tasks as compared to prior exploration methods.
Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor
TLDR
This paper proposes soft actor-critic, an off-policy actor-Critic deep RL algorithm based on the maximum entropy reinforcement learning framework, and achieves state-of-the-art performance on a range of continuous control benchmark tasks, outperforming prior on-policy and off- policy methods.
Editorial: Intrinsically Motivated Open-Ended Learning in Autonomous Robots
Making Sense of Reinforcement Learning and Probabilistic Inference
TLDR
It is demonstrated that the popular 'RL as inference' approximation can perform poorly in even very basic problems, but it is shown that with a small modification the framework does yield algorithms that can provably perform well, and that the resulting algorithm is equivalent to the recently proposed K-learning, which is further connected with Thompson sampling.
If MaxEnt RL is the Answer, What is the Question?
TLDR
This paper formally shows that MaxEnt RL does optimally solve certain classes of control problems with variability in the reward function, and suggests that domains with uncertainty in the task goal may be especially well-suited for MaxEntRL methods.
Provably Efficient Maximum Entropy Exploration
TLDR
This work studies a broad class of objectives that are defined solely as functions of the state-visitation frequencies that are induced by how the agent behaves, and provides an efficient algorithm to optimize such intrinsically defined objectives, when given access to a black box planning oracle.
VIREL: A Variational Inference Framework for Reinforcement Learning
TLDR
VIREL is proposed, a novel, theoretically grounded probabilistic inference framework for RL that utilises a parametrised action-value function to summarise future dynamics of the underlying MDP and it is shown that the actor-critic algorithm can be reduced to expectation-maximisation, with policy improvement equivalent to an E-step and policy evaluation to an M-step.
Maximum a Posteriori Policy Optimisation
TLDR
This work introduces a new algorithm for reinforcement learning called Maximum aposteriori Policy Optimisation (MPO) based on coordinate ascent on a relative entropy objective and develops two off-policy algorithms that are competitive with the state-of-the-art in deep reinforcement learning.
Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review
TLDR
This article will discuss how a generalization of the reinforcement learning or optimal control problem, which is sometimes termed maximum entropy reinforcement learning, is equivalent to exact probabilistic inference in the case of deterministic dynamics, and variational inference inThe case of stochastic dynamics.
Diversity is All You Need: Learning Skills without a Reward Function
TLDR
The proposed DIAYN ("Diversity is All You Need"), a method for learning useful skills without a reward function, learns skills by maximizing an information theoretic objective using a maximum entropy policy.
...
...