Improving neural networks by preventing co-adaptation of feature detectors


When a large feedforward neural network is trained on a small training set, it typically performs poorly on held-out test data. This " overfitting " is greatly reduced by randomly omitting half of the feature detectors on each training case. This prevents complex co-adaptations in which a feature detector is only helpful in the context of several other specific feature detectors. Instead, each neuron learns to detect a feature that is generally helpful for producing the correct answer given the combinatorially large variety of internal contexts in which it must operate. Random " dropout " gives big improvements on many benchmark tasks and sets new records for speech and object recognition. A feedforward, artificial neural network uses layers of non-linear " hidden " units between its inputs and its outputs. By adapting the weights on the incoming connections of these hidden units it learns feature detectors that enable it to predict the correct output when given an input vector (1). If the relationship between the input and the correct output is complicated and the network has enough hidden units to model it accurately, there will typically be many different settings of the weights that can model the training set almost perfectly, especially if there is only a limited amount of labeled training data. Each of these weight vectors will make different predictions on held-out test data and almost all of them will do worse on the test data than on the training data because the feature detectors have been tuned to work well together on the training data but not on the test data. Overfitting can be reduced by using " dropout " to prevent complex co-adaptations on the training data. On each presentation of each training case, each hidden unit is randomly omitted from the network with a probability of 0.5, so a hidden unit cannot rely on other hidden units being present. Another way to view the dropout procedure is as a very efficient way of performing model averaging with neural networks. A good way to reduce the error on the test set is to average the predictions produced by a very large number of different networks. The standard

Extracted Key Phrases

6 Figures and Tables

Showing 1-10 of 14 references

IEEE Transactions on Audio, Speech, and Language Processing

  • G Dahl, D Yu, L Deng, A Acero
  • 2012

Artificial Intelligence and Statistics

  • R R Salakhutdinov, G E Hinton
  • 2009


  • J D A Livnat, C Papadimitriou, M W Feldman
  • 2008


  • G E Hinton, R Salakhutdinov
  • 2006

Neural Computation

  • G E Hinton
  • 2002

Machine Learning

  • L Breiman
  • 2001
Showing 1-10 of 1,223 extracted citations
Citations per Year

2,097 Citations

Semantic Scholar estimates that this publication has received between 1,903 and 2,313 citations based on the available data.

See our FAQ for additional information.