Corpus ID: 211126609

RL agents Implicitly Learning Human Preferences

  title={RL agents Implicitly Learning Human Preferences},
  author={Nevan Wichers},
In the real world, RL agents should be rewarded for fulfilling human preferences. We show that RL agents implicitly learn the preferences of humans in their environment. Training a classifier to predict if a simulated human's preferences are fulfilled based on the activations of a RL agent's neural network gets .93 AUC. Training a classifier on the raw environment state gets only .8 AUC. Training the classifier off of the RL agent's activations also does much better than training off of… Expand


Deep Reinforcement Learning from Human Preferences
Modeling Others using Oneself in Multi-Agent Reinforcement Learning
Visualizing and Understanding Atari Agents
Agent Modeling as Auxiliary Task for Deep Reinforcement Learning
Machine Theory of Mind
Solving Rubik's Cube with a Robot Hand
  • OpenAI, I. Akkaya, +16 authors Lei Zhang
  • Mathematics, Computer Science
  • ArXiv
  • 2019
Graying the black box: Understanding DQNs
Mirror neurons and the simulation theory of mind-reading
Deep Transfer Learning: A new deep learning glitch classification method for advanced LIGO