Is it Time to Swish? Comparing Deep Learning Activation Functions Across NLP tasks

  title={Is it Time to Swish? Comparing Deep Learning Activation Functions Across NLP tasks},
  author={Steffen Eger and Paul Youssef and Iryna Gurevych},
Activation functions play a crucial role in neural networks because they are the nonlinearities which have been attributed to the success story of deep learning. [] Key Method While most works compare newly proposed activation functions on few tasks (usually from image classification) and against few competitors (usually ReLU), we perform the first large-scale comparison of 21 activation functions across eight different NLP tasks. We find that a largely unknown activation function performs most stably across…

Figures and Tables from this paper

How important are activation functions in regression and classification? A survey, performance comparison, and future directions

This work surveys the activation functions that have been employed in the past as well as the current state-of-the-art, and presents various developments in activation functions over the years and the advantages and disadvantages or limitations of these activation functions.

Activation functions in deep learning: A comprehensive survey and benchmark

Comparison and Combination of Activation Functions in Broad Learning System

  • Lili XuC. L. P. Chen
  • Computer Science
    2020 IEEE International Conference on Systems, Man, and Cybernetics (SMC)
  • 2020
Among all selected activation functions, sigmoid leads to faster training process and greater approximation capability than others in general tasks, which achieves better performance than corresponding base activation functions in standard BLS.

A Comprehensive Survey and Performance Analysis of Activation Functions in Deep Learning

A comprehensive overview and survey is presented for AFs in neural networks for deep learning, covering different classes of AFs such as Logistic Sigmoid and Tanh based, ReLU based, ELU based, and Learning based.

Activation Functions for Generalized Learning Vector Quantization - A Performance Comparison

This paper investigates successful candidates of activation functions known for MLPs for application in GLVQ and their influence on the performance.

Investigation of Activation Functions for Generalized Learning Vector Quantization

It is shown that the GLVQ classifier function can also be interpreted as a generalized perceptron, and whether successful candidates of activation functions for MLP also perform well forGLVQ.

Neuroevolution based hierarchical activation function for long short-term model network

A differential evolution algorithm (DEA)-based hierarchical combined activation to surrogate the default activation functions of the LSTM cell is proposed to discover an optimal combination of function for the L STM network.

Interpretable Deep Learning Methods for Classifying Folktales According to the Aarne-Thompson-Uther Scheme

This work proposes to evaluate the use of a cross-language neural network approach based on the previously proposed Hierarchical Attention Network to classify multi-lingual folktales, as well as to explain predictions by generating visualizations, and demonstrates the usefulness of neural attention as a method for generating intuitive visualizations of results.

A Novel Deep Learning Approach Using Contextual Embeddings for Toponym Resolution

This article describes a novel approach for toponym resolution with deep neural networks. The proposed approach does not involve matching references in the text against entries in a gazetteer,

Study of the Effect of Combining Activation Functions in a Convolutional Neural Network

Nine new activation functions based on combinations of classical functions such as ReLU and sigmoid are presented and it is demonstrated that the accuracy of a CNN could be increased by 1.18% with the new proposed activation functions.



Empirical Evaluation of Rectified Activations in Convolutional Network

The experiments suggest that incorporating a non-zero slope for negative part in rectified activation units could consistently improve the results, and are negative on the common belief that sparsity is the key of good performance in ReLU.

Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning

Clear empirical evidence that training with residual connections accelerates the training of Inception networks significantly is given and several new streamlined architectures for both residual and non-residual Inception Networks are presented.

Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification

This work proposes a Parametric Rectified Linear Unit (PReLU) that generalizes the traditional rectified unit and derives a robust initialization method that particularly considers the rectifier nonlinearities.

Self-Normalizing Neural Networks

Self-normalizing neural networks (SNNs) are introduced to enable high-level abstract representations and it is proved that activations close to zero mean and unit variance that are propagated through many network layers will converge towards zero meanand unit variance -- even under the presence of noise and perturbations.

Revise Saturated Activation Functions

It is shown that "penalized tanh" is comparable and even outperforms the state-of-the-art non-saturated functions including ReLU and leaky ReLU on deep convolution neural networks.

Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs)

The "exponential linear unit" (ELU) which speeds up learning in deep neural networks and leads to higher classification accuracies and significantly better generalization performance than ReLUs and LReLUs on networks with more than 5 layers.

Supervised Learning of Universal Sentence Representations from Natural Language Inference Data

It is shown how universal sentence representations trained using the supervised data of the Stanford Natural Language Inference datasets can consistently outperform unsupervised methods like SkipThought vectors on a wide range of transfer tasks.

Reporting Score Distributions Makes a Difference: Performance Study of LSTM-networks for Sequence Tagging

It is shown that reporting a single performance score is insufficient to compare non-deterministic approaches and proposed to compare score distributions based on multiple executions, and network architectures are presented that produce both superior performance as well as are more stable with respect to the remaining hyperparameters.

Deep Pyramid Convolutional Neural Networks for Text Categorization

A low-complexity word-level deep convolutional neural network architecture for text categorization that can efficiently represent long-range associations in text and outperforms the previous best models on six benchmark datasets for sentiment classification and topic categorization.

Taming the waves: sine as activation function in deep neural networks

This paper formally characterize why deep neural networks can indeed often be difficult to train even in very simple scenarios, and describes how the presence of infinitely many and shallow local minima emerges from the architecture.