# Bridging Nonlinearities and Stochastic Regularizers with Gaussian Error Linear Units

@article{Hendrycks2016BridgingNA, title={Bridging Nonlinearities and Stochastic Regularizers with Gaussian Error Linear Units}, author={Dan Hendrycks and Kevin Gimpel}, journal={ArXiv}, year={2016}, volume={abs/1606.08415} }

We propose the Gaussian Error Linear Unit (GELU), a high-performing neural network activation function. The GELU nonlinearity is the expected transformation of a stochastic regularizer which randomly applies the identity or zero map, combining the intuitions of dropout and zoneout while respecting neuron values. This connection suggests a new probabilistic understanding of nonlinearities. We perform an empirical evaluation of the GELU nonlinearity against the ReLU and ELU activations and find…

## 280 Citations

What can linearized neural networks actually say about generalization?

- Computer ScienceNeurIPS
- 2021

It is shown that the linear approximations can indeed rank the learning complexity of certain tasks for neural networks, even when they achieve very different performances, and that networks overfit to these tasks mostly due to the evolution of their kernel during training, thus, revealing a new type of implicit bias.

Piecewise Linear Units Improve Deep Neural Networks

- Computer ScienceArXiv
- 2021

Across a distribution of 30 experiments, it is shown that for the same model architecture, hyperparameters, and pre-processing, PiLU significantly outperforms ReLU: reducing classification error by 18.53% on CIFAR-10 and 13.13% on TSP, for a minor increase in the number of neurons.

LiSHT: Non-Parametric Linearly Scaled Hyperbolic Tangent Activation Function for Neural Networks

- Computer ScienceArXiv
- 2019

The proposed LiSHT activation function is an attempt to scale the non-linear Hyperbolic Tangent (Tanh) function by a linear function and tackle the dying gradient problem.

GLU Variants Improve Transformer

- MathematicsArXiv
- 2020

Gated Linear Units (GLU) consist of the component-wise product of two linear projections, one of which is first passed through a sigmoid function, and it is found that some of them yield quality improvements over the typically-used ReLU or GELU activations.

Squareplus: A Softplus-Like Algebraic Rectifier

- Computer ScienceArXiv
- 2021

The specific non-linearity applied at each layer of a neural network influences training dynamics and test-time accuracy, and is a critical tool when designing architectures whose outputs must lie within some range.

Natural-Logarithm-Rectified Activation Function in Convolutional Neural Networks

- Computer Science2019 IEEE 5th International Conference on Computer and Communications (ICCC)
- 2019

An improved activation function, which is named the natural-logarithm-rectified linear unit (NLReLU), which uses the parametric natural logarithmic transform to improve ReLU and reduces the bias shift effect and heteroscedasticity of neuron data distributions among network layers in order to accelerate the learning process.

Beta and Alpha Regularizers of Mish Activation Functions for Machine Learning Applications in Deep Neural Networks

- Computer Science
- 2022

A two-factor non-saturating activation functions known as Bea-Mish for machine learning applications in deep neural networks is proposed and empirical results show that this approach outperforms native Mish using SqueezeNet backbone with an average precision of 2.51% and top-1accuracy of 1.20%.

On the Selection of Initialization and Activation Function for Deep Neural Networks

- Computer ScienceArXiv
- 2018

This analysis identifies a class of activation functions that improve the information propagation over ReLU-like functions that includes the Swish activation, which provides a theoretical grounding for the excellent empirical performance of $\phi_{swish}$ observed in these contributions.

Generalizing and Improving Weight Initialization

- Computer ScienceArXiv
- 2016

A new weight initialization suited for arbitrary nonlinearities by generalizing previous weight initializations is proposed, which enables improved accuracy over previous initializations, and allows for training highly regularized neural networks where previous initialization lead to poor convergence.

Constrained Block Nonlinear Neural Dynamical Models

- Computer Science2021 American Control Conference (ACC)
- 2021

This work explores a novel formulation for data-efficient learning of deep control-oriented nonlinear dynamical models by embedding local model structure and constraints by encoding neural network blocks that represent input, state, and output dynamics with constraints placed on the network weights and system variables.

## References

SHOWING 1-10 OF 24 REFERENCES

Zoneout: Regularizing RNNs by Randomly Preserving Hidden Activations

- Computer ScienceICLR
- 2017

This work proposes zoneout, a novel method for regularizing RNNs that uses random noise to train a pseudo-ensemble, improving generalization and performs an empirical investigation of various RNN regularizers, and finds that zoneout gives significant performance improvements across tasks.

Exact solutions to the nonlinear dynamics of learning in deep linear neural networks

- Computer ScienceICLR
- 2014

It is shown that deep linear networks exhibit nonlinear learning phenomena similar to those seen in simulations of nonlinear networks, including long plateaus followed by rapid transitions to lower error solutions, and faster convergence from greedy unsupervised pretraining initial conditions than from random initial conditions.

Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs)

- Computer ScienceICLR
- 2016

The "exponential linear unit" (ELU) which speeds up learning in deep neural networks and leads to higher classification accuracies and significantly better generalization performance than ReLUs and LReLUs on networks with more than 5 layers.

Rectifier Nonlinearities Improve Neural Network Acoustic Models

- Computer Science
- 2013

This work explores the use of deep rectifier networks as acoustic models for the 300 hour Switchboard conversational speech recognition task, and analyzes hidden layer representations to quantify differences in how ReL units encode inputs as compared to sigmoidal units.

Adaptive dropout for training deep neural networks

- Computer ScienceNIPS
- 2013

A method is described called 'standout' in which a binary belief network is overlaid on a neural network and is used to regularize of its hidden units by selectively setting activities to zero, which achieves lower classification error rates than other feature learning methods, including standard dropout, denoising auto-encoders, and restricted Boltzmann machines.

Residual Networks are Exponential Ensembles of Relatively Shallow Networks

- Computer ScienceArXiv
- 2016

This work introduces a novel interpretation of residual networks showing they are exponential ensembles, and suggests that in addition to describing neural networks in terms of width and depth, there is a third dimension: multiplicity, the size of the implicit ensemble.

Learning with Pseudo-Ensembles

- Computer ScienceNIPS
- 2014

A novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it is presented, which naturally extends to the semi-supervised setting, where it produces state-of-the-art results.

Generalizing and Improving Weight Initialization

- Computer ScienceArXiv
- 2016

A new weight initialization suited for arbitrary nonlinearities by generalizing previous weight initializations is proposed, which enables improved accuracy over previous initializations, and allows for training highly regularized neural networks where previous initialization lead to poor convergence.

Fast dropout training

- Computer ScienceICML
- 2013

This work shows how to do fast dropout training by sampling from or integrating a Gaussian approximation, instead of doing Monte Carlo optimization of this objective, which gives an order of magnitude speedup and more stability.

Rectified Linear Units Improve Restricted Boltzmann Machines

- Computer ScienceICML
- 2010

Restricted Boltzmann machines were developed using binary stochastic hidden units that learn features that are better for object recognition on the NORB dataset and face verification on the Labeled Faces in the Wild dataset.