• Corpus ID: 4023298

Natural Gradient Deep Q-learning

  title={Natural Gradient Deep Q-learning},
  author={Ethan Knight and Osher Lerner},
This paper presents findings for training a Q-learning reinforcement learning agent using natural gradient techniques. We compare the original deep Q-network (DQN) algorithm to its natural gradient counterpart (NGDQN), measuring NGDQN and DQN performance on classic controls environments without target networks. We find that NGDQN performs favorably relative to DQN, converging to significantly better policies faster and more frequently. These results indicate that natural gradient could be used… 

Figures and Tables from this paper

Beyond Target Networks: Improving Deep Q-learning with Functional Regularization

An alternative training method based on functional regularization which uses up-to-date parameters to estimate the target Q-values, thereby speeding up training while maintaining stability and showing empirical improvements in sample efficiency and performance across a range of Atari and simulated robotics environments.

Towards Characterizing Divergence in Deep Q-Learning

An algorithm is developed which permits stable deep Q-learning for continuous control without any of the tricks conventionally used (such as target networks, adaptive gradient optimizers, or using multiple Q functions).

Optimizing Q-Learning with K-FAC Algorithm

Considering the latest results, it is shown that DDQN with K-FAC learns more quickly than with other optimizers and improves constantly in contradiction to similar with Adam or RMSProp.

Analysis of Q-learning with Adaptation and Momentum Restart for Gradient Descent

The convergence rate for Q-AMSGrad, which is the Q-learning algorithm with AMSGrad update (a commonly adopted alternative of Adam for theoretical analysis), is characterized and the momentum restart scheme is proposed, resulting in the so-called Q-amSGradR algorithm, which outperforms the vanilla Q- learning with SGD updates.

BRPO: Batch Residual Policy Optimization

This work derives a new for RL method, BRPO, which learns both the policy and allowable deviation that jointly maximize a lower bound on policy performance, and shows that BRPO achieves the state-of-the-art performance in a number of tasks.

Bridging the Gap Between Target Networks and Functional Regularization

It is demonstrated that replacing Target Networks with the more theoretically grounded Functional Regularization approach leads to better sample efficiency and performance improvements.

Direction Concentration Learning: Enhancing Congruency in Machine Learning

The experimental results show that the proposed DCL method generalizes to state-of-the-art models and optimizers, as well as improves the performances of saliency prediction task, continual learning task, and classification task and helps mitigate the catastrophic forgetting problem in the continuallearning task.

Toward Efficient Gradient-Based Value Estimation

To resolve the adverse effect of poor conditioning of MSBE on gradient based methods, a low complexity batch-free proximal method that approximately follows the Gauss-Newton direction and is asymptotically robust to parameterization is proposed.



Playing Atari with Deep Reinforcement Learning

This work presents the first deep learning model to successfully learn control policies directly from high-dimensional sensory input using reinforcement learning, which outperforms all previous approaches on six of the games and surpasses a human expert on three of them.

Deep Q-learning From Demonstrations

This paper presents an algorithm, Deep Q-learning from Demonstrations (DQfD), that leverages small sets of demonstration data to massively accelerate the learning process even from relatively small amounts of demonstrating data and is able to automatically assess the necessary ratio of demonstrationData while learning thanks to a prioritized replay mechanism.

Natural Temporal Difference Learning

This paper presents and analyzes quadratic and linear time natural temporal difference learning algorithms, and proves that they are covariant, and suggests that the natural algorithms can match or outperform their non-natural counterparts using linear function approximation, and drastically improve upon them when using non-linear function approximation.

On-line Q-learning using connectionist systems

Simulations show that on-line learning algorithms are less sensitive to the choice of training parameters than backward replay, and that the alternative update rules of MCQ-L and Q( ) are more robust than standard Q-learning updates.

Self-improving reactive agents based on reinforcement learning, planning and teaching

This paper compares eight reinforcement learning frameworks: Adaptive heuristic critic (AHC) learning due to Sutton, Q-learning due to Watkins, and three extensions to both basic methods for speeding up learning and two extensions are experience replay, learning action models for planning, and teaching.

A Natural Policy Gradient

This work provides a natural gradient method that represents the steepest descent direction based on the underlying structure of the parameter space and shows drastic performance improvements in simple MDPs and in the more challenging MDP of Tetris.

Gradient temporal-difference learning algorithms

We present a new family of gradient temporal-difference (TD) learning methods with function approximation whose complexity, both in terms of memory and per-time-step computation, scales linearly with

Reinforcement learning for robots using neural networks

This dissertation concludes that it is possible to build artificial agents than can acquire complex control policies effectively by reinforcement learning and enable its applications to complex robot-learning problems.

Prioritized Experience Replay

A framework for prioritizing experience, so as to replay important transitions more frequently, and therefore learn more efficiently, in Deep Q-Networks, a reinforcement learning algorithm that achieved human-level performance across many Atari games.

Trust Region Policy Optimization

A method for optimizing control policies, with guaranteed monotonic improvement, by making several approximations to the theoretically-justified scheme, called Trust Region Policy Optimization (TRPO).