Corpus ID: 3615386

Bridging the Gap Between Value and Policy Based Reinforcement Learning

@inproceedings{Nachum2017BridgingTG,
  title={Bridging the Gap Between Value and Policy Based Reinforcement Learning},
  author={Ofir Nachum and Mohammad Norouzi and Kelvin Xu and Dale Schuurmans},
  booktitle={NIPS},
  year={2017}
}
We establish a new connection between value and policy based reinforcement learning (RL) based on a relationship between softmax temporal value consistency and policy optimality under entropy regularization. Specifically, we show that softmax consistent action values correspond to optimal entropy regularized policy probabilities along any action sequence, regardless of provenance. From this observation, we develop a new RL algorithm, Path Consistency Learning (PCL), that minimizes a notion of… Expand
213 Citations
Relative Entropy Regularized Policy Iteration
  • 27
  • PDF
Divergence-Augmented Policy Optimization
  • 1
  • PDF
Smoothed Action Value Functions for Learning Gaussian Policies
  • 16
  • PDF
Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor
  • 1,313
  • Highly Influenced
  • PDF
SMOOTHED ACTION VALUE FUNCTIONS
  • 2017
  • Highly Influenced
Direct and indirect reinforcement learning
  • 3
  • PDF
Equivalence Between Policy Gradients and Soft Q-Learning
  • 179
  • Highly Influenced
  • PDF
Merging Deterministic Policy Gradient Estimations with Varied Bias-Variance Tradeoff for Effective Deep Reinforcement Learning
  • Gang Chen
  • Computer Science, Mathematics
  • ArXiv
  • 2019
  • 1
  • PDF
Similarities between policy gradient methods (PGM) in Reinforcement learning (RL) and supervised learning (SL)
  • 1
  • PDF
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 72 REFERENCES
PGQ: Combining policy gradient and Q-learning
  • 90
  • PDF
Off-Policy Temporal Difference Learning with Function Approximation
  • 296
  • PDF
Safe and Efficient Off-Policy Reinforcement Learning
  • 326
  • Highly Influential
  • PDF
Equivalence Between Policy Gradients and Soft Q-Learning
  • 179
  • PDF
Eligibility Traces for Off-Policy Policy Evaluation
  • 440
  • Highly Influential
  • PDF
The Reactor: A Sample-Efficient Actor-Critic Architecture
  • 52
  • PDF
Improving Policy Gradient by Exploring Under-appreciated Rewards
  • 25
  • PDF
Q-Prop: Sample-Efficient Policy Gradient with An Off-Policy Critic
  • 222
  • Highly Influential
  • PDF
Reinforcement Learning with Deep Energy-Based Policies
  • 480
  • PDF
...
1
2
3
4
5
...