• Corpus ID: 125617073

# Gaussian Error Linear Units (GELUs)

@article{Hendrycks2016GaussianEL,
title={Gaussian Error Linear Units (GELUs)},
author={Dan Hendrycks and Kevin Gimpel},
journal={arXiv: Learning},
year={2016}
}
• Published 27 June 2016
• Computer Science
• arXiv: Learning
We propose the Gaussian Error Linear Unit (GELU), a high-performing neural network activation function. The GELU activation function is $x\Phi(x)$, where $\Phi(x)$ the standard Gaussian cumulative distribution function. The GELU nonlinearity weights inputs by their value, rather than gates inputs by their sign as in ReLUs ($x\mathbf{1}_{x>0}$). We perform an empirical evaluation of the GELU nonlinearity against the ReLU and ELU activations and find performance improvements across all considered…
896 Citations
Symmetrical Gaussian Error Linear Units (SGELUs)
• Computer Science
ArXiv
• 2019
A novel neural network activation function, called Symmetrical Gaussian Error Linear Unit (SGELU), is proposed to obtain high performance by effectively integrating the property of the stochastic regularizer in the GELU with the symmetrical characteristics.
Two-argument activation functions learn soft XOR operations like cortical neurons
• Computer Science, Biology
ArXiv
• 2021
This work emulates more biologically realistic neurons by learning canonical activation functions with two input arguments, analogous to basal and apical dendrites, in a network-in-network architecture where each neuron is modeled as a multilayer perceptron with two inputs and a single output.
SinLU: Sinu-Sigmoidal Linear Unit
• Computer Science
Mathematics
• 2022
The proposed SinLU incorporates the sine wave, allowing new functionalities over traditional linear unit activations, and two trainable parameters of this function control the participation of the sinusoidal nature in the function, and help to achieve an easily trainable, and fast converging function.
ErfAct and PSerf: Non-monotonic smooth trainable Activation Functions
• Computer Science
ArXiv
• 2021
This work proposes two novel non-monotonic smooth trainable activation functions, called ErfAct and Pserf, and suggests that the proposed functions improve the network performance compared to the widely used activations like ReLU, Swish, and Mish.
An Efficient Asymmetric Nonlinear Activation Function for Deep Neural Networks
• Computer Science
Symmetry
• 2022
To demonstrate the effectiveness of this function in the field of object detection, the proposed activation function is compared with several state-of-the-art activation functions on the typical backbone networks such as ResNet and DSPDarkNet.
Learning a Single Neuron for Non-monotonic Activation Functions
This work establishes learnability without assuming monotonicity of a single neuron x (cid:55)→ σ ( w T x ) with gradient descent (GD) when the input distribution is the standard Gaussian, and shows that mild conditions on σ are enough to guarantee the learnability in polynomial time andPolynomial samples.
LiSHT: Non-Parametric Linearly Scaled Hyperbolic Tangent Activation Function for Neural Networks
• Computer Science
ArXiv
• 2019
The proposed LiSHT activation function is an attempt to scale the non-linear Hyperbolic Tangent (Tanh) function by a linear function and tackle the dying gradient problem.
Introducing the DOME Activation Functions
• Computer Science, Mathematics
ArXiv
• 2021
A novel non-linear activation function that spontaneously induces class-compactness and regularization in the embedding space of neural networks and it is shown that models using the function exhibit extra robustness against adversarial attacks.
An Analysis of State-of-the-art Activation Functions For Supervised Deep Neural Network
• Computer Science
2021 International Conference on System Science and Engineering (ICSSE)
• 2021
This paper provides an analysis of state-of-the-art activation functions with respect to supervised classification of deep neural network. These activation functions comprise of Rectified Linear
Linear approximability of two-layer neural networks: A comprehensive analysis based on spectral decay
• Computer Science
ArXiv
• 2021
It is proved that for a family of non-smooth activation functions, including ReLU, approximating any single neuron with random features suffers from the curse of dimensionality, providing an explicit separation of expressiveness between neural networks and random feature models.

## References

SHOWING 1-10 OF 26 REFERENCES
Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs)
• Computer Science
ICLR
• 2016
The "exponential linear unit" (ELU) which speeds up learning in deep neural networks and leads to higher classification accuracies and significantly better generalization performance than ReLUs and LReLUs on networks with more than 5 layers.
Adaptive dropout for training deep neural networks
• Computer Science
NIPS
• 2013
A method is described called 'standout' in which a binary belief network is overlaid on a neural network and is used to regularize of its hidden units by selectively setting activities to zero, which achieves lower classification error rates than other feature learning methods, including standard dropout, denoising auto-encoders, and restricted Boltzmann machines.
Rectified Linear Units Improve Restricted Boltzmann Machines
• Computer Science
ICML
• 2010
Restricted Boltzmann machines were developed using binary stochastic hidden units that learn features that are better for object recognition on the NORB dataset and face verification on the Labeled Faces in the Wild dataset.
Natural Neural Networks
• Computer Science
NIPS
• 2015
A specific example that employs a simple and efficient reparametrization of the neural network weights by implicitly whitening the representation obtained at each layer, while preserving the feed-forward computation of the network.
Residual Networks are Exponential Ensembles of Relatively Shallow Networks
• Computer Science
ArXiv
• 2016
This work introduces a novel interpretation of residual networks showing they are exponential ensembles, and suggests that in addition to describing neural networks in terms of width and depth, there is a third dimension: multiplicity, the size of the implicit ensemble.
Zoneout: Regularizing RNNs by Randomly Preserving Hidden Activations
• Computer Science
ICLR
• 2017
This work proposes zoneout, a novel method for regularizing RNNs that uses random noise to train a pseudo-ensemble, improving generalization and performs an empirical investigation of various RNN regularizers, and finds that zoneout gives significant performance improvements across tasks.
Dropout: a simple way to prevent neural networks from overfitting
• Computer Science
J. Mach. Learn. Res.
• 2014
It is shown that dropout improves the performance of neural networks on supervised learning tasks in vision, speech recognition, document classification and computational biology, obtaining state-of-the-art results on many benchmark data sets.
Neural networks and physical systems with emergent collective computational abilities.
• J. Hopfield
• Computer Science
Proceedings of the National Academy of Sciences of the United States of America
• 1982
A model of a system having a large number of simple equivalent components, based on aspects of neurobiology but readily adapted to integrated circuits, produces a content-addressable memory which correctly yields an entire memory from any subpart of sufficient size.
Deep Residual Networks with Exponential Linear Unit
• Computer Science
ArXiv
• 2016
This paper proposes to replace the combination of ReLU and Batch Normalization with Exponential Linear Unit (ELU) in Residual Networks, and shows that this not only speeds up the learning behavior in Residine Networks, but also improves the classification performance as the depth increases.