• Corpus ID: 202598604

FastBERT : Speeding up Self-attentions

@inproceedings{Majumder2019FastBERTS,
  title={FastBERT : Speeding up Self-attentions},
  author={Bodhisattwa Prasad Majumder and Huanru Henry Mao and Khalil Mrini},
  year={2019}
}
Bi-directional Encoder Representations for Transformers (BERT) (Devlin et al., 2018) have achieved state-of-the-art performance in many natural language understanding tasks on the GLUE benchmark (Wang et al., 2018). The model architecture consists of 12 stacked blocks of 12 multi-head self-attentions followed by 3072 units of point-wise feed-forward network, totaling 110 million parameters. Such a large model is useful when computational power is abundant, but is too heavy to deploy on mobile… 

References

SHOWING 1-10 OF 10 REFERENCES

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.

Attention is All you Need

A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.

Universal Transformers

The Universal Transformer (UT), a parallel-in-time self-attentive recurrent sequence model which can be cast as a generalization of the Transformer model and which addresses issues of parallelizability and global receptive field, is proposed.

An Improved Relative Self-Attention Mechanism for Transformer with Application to Music Generation

In experiments on symbolic music, relative selfattention substantially improves sample quality for unconditioned generation and is able to generate sequences of lengths longer than those from the training set, making it possible to train much longer sequences and achieve faster convergence.

FitNets: Hints for Thin Deep Nets

This paper extends the idea of a student network that could imitate the soft output of a larger teacher network or ensemble of networks, using not only the outputs but also the intermediate representations learned by the teacher as hints to improve the training process and final performance of the student.

Distilling the Knowledge in a Neural Network

This work shows that it can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model and introduces a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse.

GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding

A benchmark of nine diverse NLU tasks, an auxiliary dataset for probing models for understanding of specific linguistic phenomena, and an online platform for evaluating and comparing models, which favors models that can represent linguistic knowledge in a way that facilitates sample-efficient learning and effective knowledge-transfer across tasks.

The Lottery Ticket Hypothesis: Training Pruned Neural Networks

The lottery ticket hypothesis and its connection to pruning are a step toward developing architectures, initializations, and training strategies that make it possible to solve the same problems with much smaller networks.

Hidden factors and hidden topics: understanding rating dimensions with review text

This paper aims to combine latent rating dimensions (such as those of latent-factor recommender systems) with latent review topics ( such as those learned by topic models like LDA), which more accurately predicts product ratings by harnessing the information present in review text.

Bag of Tricks for Efficient Text Classification

A simple and efficient baseline for text classification is explored that shows that the fast text classifier fastText is often on par with deep learning classifiers in terms of accuracy, and many orders of magnitude faster for training and evaluation.