# Gaussian Error Linear Units (GELUs)

@article{Hendrycks2016GaussianEL, title={Gaussian Error Linear Units (GELUs)}, author={Dan Hendrycks and Kevin Gimpel}, journal={arXiv: Learning}, year={2016} }

We propose the Gaussian Error Linear Unit (GELU), a high-performing neural network activation function. The GELU activation function is $x\Phi(x)$, where $\Phi(x)$ the standard Gaussian cumulative distribution function. The GELU nonlinearity weights inputs by their value, rather than gates inputs by their sign as in ReLUs ($x\mathbf{1}_{x>0}$). We perform an empirical evaluation of the GELU nonlinearity against the ReLU and ELU activations and find performance improvements across all consideredâ€¦Â

No Paper Link Available

## 896 Citations

Symmetrical Gaussian Error Linear Units (SGELUs)

- Computer ScienceArXiv
- 2019

A novel neural network activation function, called Symmetrical Gaussian Error Linear Unit (SGELU), is proposed to obtain high performance by effectively integrating the property of the stochastic regularizer in the GELU with the symmetrical characteristics.

Two-argument activation functions learn soft XOR operations like cortical neurons

- Computer Science, BiologyArXiv
- 2021

This work emulates more biologically realistic neurons by learning canonical activation functions with two input arguments, analogous to basal and apical dendrites, in a network-in-network architecture where each neuron is modeled as a multilayer perceptron with two inputs and a single output.

SinLU: Sinu-Sigmoidal Linear Unit

- Computer ScienceMathematics
- 2022

The proposed SinLU incorporates the sine wave, allowing new functionalities over traditional linear unit activations, and two trainable parameters of this function control the participation of the sinusoidal nature in the function, and help to achieve an easily trainable, and fast converging function.

ErfAct and PSerf: Non-monotonic smooth trainable Activation Functions

- Computer ScienceArXiv
- 2021

This work proposes two novel non-monotonic smooth trainable activation functions, called ErfAct and Pserf, and suggests that the proposed functions improve the network performance compared to the widely used activations like ReLU, Swish, and Mish.

An Efficient Asymmetric Nonlinear Activation Function for Deep Neural Networks

- Computer ScienceSymmetry
- 2022

To demonstrate the effectiveness of this function in the field of object detection, the proposed activation function is compared with several state-of-the-art activation functions on the typical backbone networks such as ResNet and DSPDarkNet.

Learning a Single Neuron for Non-monotonic Activation Functions

- Computer Science
- 2022

This work establishes learnability without assuming monotonicity of a single neuron x (cid:55)â†’ Ïƒ ( w T x ) with gradient descent (GD) when the input distribution is the standard Gaussian, and shows that mild conditions on Ïƒ are enough to guarantee the learnability in polynomial time andPolynomial samples.

LiSHT: Non-Parametric Linearly Scaled Hyperbolic Tangent Activation Function for Neural Networks

- Computer ScienceArXiv
- 2019

The proposed LiSHT activation function is an attempt to scale the non-linear Hyperbolic Tangent (Tanh) function by a linear function and tackle the dying gradient problem.

Introducing the DOME Activation Functions

- Computer Science, MathematicsArXiv
- 2021

A novel non-linear activation function that spontaneously induces class-compactness and regularization in the embedding space of neural networks and it is shown that models using the function exhibit extra robustness against adversarial attacks.

An Analysis of State-of-the-art Activation Functions For Supervised Deep Neural Network

- Computer Science2021 International Conference on System Science and Engineering (ICSSE)
- 2021

This paper provides an analysis of state-of-the-art activation functions with respect to supervised classification of deep neural network. These activation functions comprise of Rectified Linearâ€¦

Linear approximability of two-layer neural networks: A comprehensive analysis based on spectral decay

- Computer ScienceArXiv
- 2021

It is proved that for a family of non-smooth activation functions, including ReLU, approximating any single neuron with random features suffers from the curse of dimensionality, providing an explicit separation of expressiveness between neural networks and random feature models.

## References

SHOWING 1-10 OF 26 REFERENCES

Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs)

- Computer ScienceICLR
- 2016

The "exponential linear unit" (ELU) which speeds up learning in deep neural networks and leads to higher classification accuracies and significantly better generalization performance than ReLUs and LReLUs on networks with more than 5 layers.

Adaptive dropout for training deep neural networks

- Computer ScienceNIPS
- 2013

A method is described called 'standout' in which a binary belief network is overlaid on a neural network and is used to regularize of its hidden units by selectively setting activities to zero, which achieves lower classification error rates than other feature learning methods, including standard dropout, denoising auto-encoders, and restricted Boltzmann machines.

Rectified Linear Units Improve Restricted Boltzmann Machines

- Computer ScienceICML
- 2010

Restricted Boltzmann machines were developed using binary stochastic hidden units that learn features that are better for object recognition on the NORB dataset and face verification on the Labeled Faces in the Wild dataset.

Natural Neural Networks

- Computer ScienceNIPS
- 2015

A specific example that employs a simple and efficient reparametrization of the neural network weights by implicitly whitening the representation obtained at each layer, while preserving the feed-forward computation of the network.

Residual Networks are Exponential Ensembles of Relatively Shallow Networks

- Computer ScienceArXiv
- 2016

This work introduces a novel interpretation of residual networks showing they are exponential ensembles, and suggests that in addition to describing neural networks in terms of width and depth, there is a third dimension: multiplicity, the size of the implicit ensemble.

Zoneout: Regularizing RNNs by Randomly Preserving Hidden Activations

- Computer ScienceICLR
- 2017

This work proposes zoneout, a novel method for regularizing RNNs that uses random noise to train a pseudo-ensemble, improving generalization and performs an empirical investigation of various RNN regularizers, and finds that zoneout gives significant performance improvements across tasks.

Dropout: a simple way to prevent neural networks from overfitting

- Computer ScienceJ. Mach. Learn. Res.
- 2014

It is shown that dropout improves the performance of neural networks on supervised learning tasks in vision, speech recognition, document classification and computational biology, obtaining state-of-the-art results on many benchmark data sets.

Neural networks and physical systems with emergent collective computational abilities.

- Computer ScienceProceedings of the National Academy of Sciences of the United States of America
- 1982

A model of a system having a large number of simple equivalent components, based on aspects of neurobiology but readily adapted to integrated circuits, produces a content-addressable memory which correctly yields an entire memory from any subpart of sufficient size.

Deep Residual Networks with Exponential Linear Unit

- Computer ScienceArXiv
- 2016

This paper proposes to replace the combination of ReLU and Batch Normalization with Exponential Linear Unit (ELU) in Residual Networks, and shows that this not only speeds up the learning behavior in Residine Networks, but also improves the classification performance as the depth increases.