• Corpus ID: 239024862

Continuous Control with Action Quantization from Demonstrations

  title={Continuous Control with Action Quantization from Demonstrations},
  author={Robert Dadashi and L'eonard Hussenot and Damien Vincent and Sertan Girgin and Anton Raichuk and Matthieu Geist and Olivier Pietquin},
In Reinforcement Learning (RL), discrete actions, as opposed to continuous actions, result in less complex exploration problems and the immediate computation of the maximum of the action-value function which is central to dynamic programming-based methods. In this paper, we propose a novel method: Action Quantization from Demonstrations (AQuaDem) to learn a discretization of continuous action spaces by leveraging the priors of demonstrations. This dramatically reduces the exploration problem… 


Deep Q-learning From Demonstrations
This paper presents an algorithm, Deep Q-learning from Demonstrations (DQfD), that leverages small sets of demonstration data to massively accelerate the learning process even from relatively small amounts of demonstrating data and is able to automatically assess the necessary ratio of demonstrationData while learning thanks to a prioritized replay mechanism.
Deep Reinforcement Learning in Large Discrete Action Spaces
This paper leverages prior information about the actions to embed them in a continuous space upon which it can generalize, and uses approximate nearest-neighbor methods to allow reinforcement learning methods to be applied to large-scale learning problems previously intractable with current methods.
Implicitly Regularized RL with Implicit Q-Values
A theoretical analysis of the algorithm's equivalence to a regularized version of value iteration, accounting for both entropy and Kullback-Leibler regularization, and that enjoys beneficial error propagation results, and evaluates its results on classic control tasks, where its results compete with state-of-the-art methods.
Q-Learning for Continuous Actions with Cross-Entropy Guided Policies
This work proposes a novel approach, called Cross-Entropy Guided Policies, or CGP, that aims to combine the stability and performance of iterative sampling policies with the low computational cost of a policy network.
Continuous Deep Q-Learning with Model-based Acceleration
This paper derives a continuous variant of the Q-learning algorithm, which it is called normalized advantage functions (NAF), as an alternative to the more commonly used policy gradient and actor-critic methods, and substantially improves performance on a set of simulated robotic control tasks.
Goal-conditioned Imitation Learning
Different approaches to incorporate demonstrations to drastically speed up the convergence to a policy able to reach any goal, also surpassing the performance of an agent trained with other Imitation Learning algorithms are investigated.
DDCO: Discovery of Deep Continuous Options for Robot Learning from Demonstrations
Results suggest that DDCO can take 3x fewer demonstrations to achieve the same reward compared to a baseline imitation learning approach, and a cross-validation method that relaxes DDO's requirement that users specify the number of options to be discovered.
Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor
This paper proposes soft actor-critic, an off-policy actor-Critic deep RL algorithm based on the maximum entropy reinforcement learning framework, and achieves state-of-the-art performance on a range of continuous control benchmark tasks, outperforming prior on-policy and off- policy methods.
A Divergence Minimization Perspective on Imitation Learning Methods
A unified probabilistic perspective on IL algorithms based on divergence minimization is presented, conclusively identifying that IRL's state-marginal matching objective contributes most to its superior performance, and applies the new understanding of IL methods to the problem of state-Marginal matching.
What Matters in Learning from Offline Human Demonstrations for Robot Manipulation
This study analyzes the most critical challenges when learning from offline human data for manipulation and highlights opportunities for learning from human datasets, such as the ability to learn proficient policies on challenging, multi-stage tasks beyond the scope of current reinforcement learning methods.