# "Oddball SGD": Novelty Driven Stochastic Gradient Descent for Training Deep Neural Networks

@article{Simpson2015OddballSN, title={"Oddball SGD": Novelty Driven Stochastic Gradient Descent for Training Deep Neural Networks}, author={Andrew J. R. Simpson}, journal={ArXiv}, year={2015}, volume={abs/1509.05765} }

Stochastic Gradient Descent (SGD) is arguably the most popular of the machine learning methods applied to training deep neural networks (DNN) today. It has recently been demonstrated that SGD can be statistically biased so that certain elements of the training set are learned more rapidly than others. In this article, we place SGD into a feedback loop whereby the probability of selection is proportional to error magnitude. This provides a novelty-driven oddball SGD process that learns more…

## 4 Citations

### Uniform Learning in a Deep Neural Network via "Oddball" Stochastic Gradient Descent

- Computer ScienceArXiv
- 2015

Using a deep neural network to encode a video, it is shown that oddball SGD can be used to enforce uniform error across the training set.

### Online Batch Selection for Faster Training of Neural Networks

- Computer ScienceArXiv
- 2015

This work investigates online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam, and proposes a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank.

### ATCH S ELECTION FOR F ASTER T RAINING OF N EURAL N ETWORKS

- Computer Science
- 2016

This work investigates online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam, and proposes a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank.

### Trust Region Methods for Training Neural Networks

- Computer Science
- 2017

It is found that stochastic subsampling methods can, in some cases, reduce the CPU time required to reach a reasonable solution when compared to the classical trust region method but this was not consistent across all datasets.

## References

SHOWING 1-9 OF 9 REFERENCES

### Parallel Dither and Dropout for Regularising Deep Neural Networks

- Computer ScienceArXiv
- 2015

This article shows that dither and dropout fail without batch averaging and introduces a new, parallel regularisation method that may be used withoutbatch averaging, which is substantially better than what is possible with batch-SGD.

### Taming the ReLU with Parallel Dither in a Deep Neural Network

- Computer ScienceArXiv
- 2015

It is argued that ReLU are useful because they are ideal demodulators - but this fast learning comes at the expense of serious nonlinear distortion products - decoy features, and it is shown that Parallel Dither acts to suppress the decoy Features, preventing overfitting and leaving the true features cleanly demodulated for rapid, reliable learning.

### Improving neural networks by preventing co-adaptation of feature detectors

- Computer ScienceArXiv
- 2012

When a large feedforward neural network is trained on a small training set, it typically performs poorly on held-out test data. This "overfitting" is greatly reduced by randomly omitting half of the…

### A Fast Learning Algorithm for Deep Belief Nets

- Computer ScienceNeural Computation
- 2006

A fast, greedy algorithm is derived that can learn deep, directed belief networks one layer at a time, provided the top two layers form an undirected associative memory.

### Abstract Learning via Demodulation in a Deep Neural Network

- Computer ScienceArXiv
- 2015

This work demonstrates that DNN learn abstract representations by a process of demodulation, and introduces a biased sigmoid activation function and uses it to show thatDNN learn and perform better when optimized for demodulating.

### Use it or Lose it: Selective Memory and Forgetting in a Perpetual Learning Machine

- Computer Science, PsychologyArXiv
- 2015

By simulating the process of practice, this work demonstrates both selective memory and selective forgetting when it is introduced statistical recall biases during PSGD.

### Gradient-based learning applied to document recognition

- Computer ScienceProc. IEEE
- 1998

This paper reviews various methods applied to handwritten character recognition and compares them on a standard handwritten digit recognition task, and Convolutional neural networks are shown to outperform all other techniques.

### Selective Adaptation to “Oddball” Sounds by the Human Auditory System

- BiologyThe Journal of Neuroscience
- 2014

It is shown that human listeners selectively adapt to novel sounds within scenes unfolding over minutes, providing the first evidence of enhanced coding of oddball sounds in a human auditory discrimination task and suggesting the existence of an adaptive mechanism that tracks the long-term statistics of sounds and deploying coding resources accordingly.

### Dither is Better than Dropout for Regularising Deep Neural Networks

- Computer ScienceArXiv
- 2015

It is demonstrated that dither provides a more effective regulariser than dropout in the regularisation effects of deep neural networks, cast through the prism of signal processing theory.