Deep Reinforcement Learning with Feedback-based Exploration

  title={Deep Reinforcement Learning with Feedback-based Exploration},
  author={Jan Scholten and Daan Wout and Carlos Celemin and Jens Kober},
  journal={2019 IEEE 58th Conference on Decision and Control (CDC)},
  • Jan Scholten, Daan Wout, +1 author J. Kober
  • Published 14 March 2019
  • Computer Science, Mathematics
  • 2019 IEEE 58th Conference on Decision and Control (CDC)
Deep Reinforcement Learning has enabled the control of increasingly complex and high-dimensional problems. However, the need of vast amounts of data before reasonable performance is attained prevents its widespread application. We employ binary corrective feedback as a general and intuitive manner to incorporate human intuition and domain knowledge in model-free machine learning. The uncertainty in the policy and the corrective feedback is combined directly in the action space as probabilistic… 


Continuous control with deep reinforcement learning
This work presents an actor-critic, model-free algorithm based on the deterministic policy gradient that can operate over continuous action spaces, and demonstrates that for many of the tasks the algorithm can learn policies end-to-end: directly from raw pixel inputs.
Overcoming Exploration in Reinforcement Learning with Demonstrations
This work uses demonstrations to overcome the exploration problem and successfully learn to perform long-horizon, multi-step robotics tasks with continuous control such as stacking blocks with a robot arm.
The importance of experience replay database composition in deep reinforcement learning
The potential of the Deep Deterministic Policy Gradient method for a robot control problem both in simulation and in a real setup is investigated and some requirements on the distribution over the state-action space of the experiences in the database are identified.
Leveraging Demonstrations for Deep Reinforcement Learning on Robotics Problems with Sparse Rewards
A general and model-free approach for Reinforcement Learning on real robotics with sparse rewards built upon the Deep Deterministic Policy Gradient algorithm to use demonstrations that out-performs DDPG, and does not require engineered rewards.
Interactive Learning with Corrective Feedback for Policies based on Deep Neural Networks
This work approaches an alternative Interactive Machine Learning strategy for training DNN policies based on human corrective feedback, with a method called Deep COACH (D-COACH), which takes advantage of the knowledge and insights of human teachers as well as the power of DNNs, but also has no need of a reward function.
Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor
This paper proposes soft actor-critic, an off-policy actor-Critic deep RL algorithm based on the maximum entropy reinforcement learning framework, and achieves state-of-the-art performance on a range of continuous control benchmark tasks, outperforming prior on-policy and off- policy methods.
Policy Shaping: Integrating Human Feedback with Reinforcement Learning
This paper introduces Advise, a Bayesian approach that attempts to maximize the information gained from human feedback by utilizing it as direct policy labels and shows that it can outperform state-of-the-art approaches and is robust to infrequent and inconsistent human feedback.
Learning Gaussian Policies from Corrective Human Feedback
Gaussian Process Coach (GPC), where feature space engineering is avoided by employing Gaussian Processes, is introduced and it is demonstrated that the novel algorithm outperforms the current state-of-the-art in final performance, convergence rate and robustness to erroneous feedback in OpenAI Gym continuous control benchmarks.
Deep Reinforcement Learning that Matters
Challenges posed by reproducibility, proper experimental techniques, and reporting procedures are investigated and guidelines to make future results in deep RL more reproducible are suggested.
Rainbow: Combining Improvements in Deep Reinforcement Learning
This paper examines six extensions to the DQN algorithm and empirically studies their combination, showing that the combination provides state-of-the-art performance on the Atari 2600 benchmark, both in terms of data efficiency and final performance.