Revealing the Dark Secrets of BERT

@article{Kovaleva2019RevealingTD,
  title={Revealing the Dark Secrets of BERT},
  author={Olga Kovaleva and Alexey Romanov and Anna Rogers and Anna Rumshisky},
  journal={ArXiv},
  year={2019},
  volume={abs/1908.08593}
}
BERT-based architectures currently give state-of-the-art performance on many NLP tasks, but little is known about the exact mechanisms that contribute to its success. In the current work, we focus on the interpretation of self-attention, which is one of the fundamental underlying components of BERT. Using a subset of GLUE tasks and a set of handcrafted features-of-interest, we propose the methodology and carry out a qualitative and quantitative analysis of the information encoded by the… 
Transformers: "The End of History" for NLP?
TLDR
This work sheds some light on some important theoretical limitations of pre-trained BERT-style models that are inherent in the general Transformer architecture and demonstrates in practice on two general types of tasks and four datasets that these limitations are indeed harmful and that addressing them can yield sizable improvements over vanilla RoBERTa and XLNet.
BERT Busters: Outlier LayerNorm Dimensions that Disrupt BERT
TLDR
It is demonstrated that pretrained Transformer encoders are surprisingly fragile to the removal of a very small number of scaling factors and biases in the output layer normalization, and the results suggest that layernormalization plays a much more important role than usually assumed.
Emergent Properties of Finetuned Language Representation Models
TLDR
This work shows empirical evidence that the [CLS] embedding in BERT contains highly redundant information, and can be compressed with minimal loss of accuracy, especially for finetuned models, dovetailing into open threads in the field about the role of over-parameterization in learning.
Poor Man's BERT: Smaller and Faster Transformer Models
TLDR
A number of memory-light model reduction strategies that do not require model pre-training from scratch are explored, which are able to prune BERT, RoBERTa and XLNet models by up to 40%, while maintaining up to 98% of their original performance.
Investigating Learning Dynamics of BERT Fine-Tuning
TLDR
It is concluded that BERT fine-tuning mainly changes the attention mode of the last layers and modifies the feature extraction modes of the intermediate and last layers.
HiddenCut: Simple Data Augmentation for Natural Language Understanding with Better Generalization
TLDR
A simple yet effective data augmentation technique to better regularize the model and encourage it to learn more generalizable features, HiddenCut, which outperforms the state-of-the-art augmentation methods on the GLUE benchmark, and consistently exhibit superior generalization performances on out-of distribution and challenging counterexamples.
Transformers: "The End of History" for Natural Language Processing?
TLDR
It is demonstrated in practice on two general types of tasks—segmentation and segment labeling—and on four datasets that these limitations of pre-trained BERT-style models that are inherent in the general Transformer architecture are indeed harmful and that addressing them can yield sizable improvements over vanilla RoBERTa and XLNet models.
A Closer Look at How Fine-tuning Changes BERT
TLDR
This work studies the English BERT family and uses two probing techniques to analyze how fine-tuning changes the space and shows that fine- Tuning improves performance because it pushes points associated with a label away from other labels.
Automatic Mixed-Precision Quantization Search of BERT
TLDR
This paper proposes an automatic mixed-precision quantization framework designed for BERT that can conduct quantization and pruning simultaneously, and leverages Differentiable Neural Architecture Search to assign scale and precision for parameters in each sub-group automatically, and at the same pruning out redundant groups of parameters.
Pruning a BERT-based Question Answering Model
TLDR
This work starts from models trained for SQuAD 2.0 and introduces gates that allow selected parts of transformers to be individually eliminated and finds that a combination of pruning attention heads and the feed-forward layer almost doubles the decoding speed.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 29 REFERENCES
Are Sixteen Heads Really Better than One?
TLDR
It is made the surprising observation that even if models have been trained using multiple heads, in practice, a large percentage of attention heads can be removed at test time without significantly impacting performance.
Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned
TLDR
It is found that the most important and confident heads play consistent and often linguistically-interpretable roles and when pruning heads using a method based on stochastic gates and a differentiable relaxation of the L0 penalty, it is observed that specialized heads are last to be pruned.
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
TLDR
A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.
What Does BERT Learn about the Structure of Language?
TLDR
This work provides novel support for the possibility that BERT networks capture structural information about language by performing a series of experiments to unpack the elements of English language structure learned by BERT.
Attention is All you Need
TLDR
A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.
SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems
TLDR
A new benchmark styled after GLUE is presented, a new set of more difficult language understanding tasks, a software toolkit, and a public leaderboard are presented.
Pay Less Attention with Lightweight and Dynamic Convolutions
TLDR
It is shown that a very lightweight convolution can perform competitively to the best reported self-attention results, and dynamic convolutions are introduced which are simpler and more efficient than self-ATTention.
Rethinking Complex Neural Network Architectures for Document Classification
TLDR
In a large-scale reproducibility study of several recent neural models, it is found that a simple BiLSTM architecture with appropriate regularization yields accuracy and F1 that are either competitive or exceed the state of the art on four standard benchmark datasets.
How transferable are features in deep neural networks?
TLDR
This paper quantifies the generality versus specificity of neurons in each layer of a deep convolutional neural network and reports a few surprising results, including that initializing a network with transferred features from almost any number of layers can produce a boost to generalization that lingers even after fine-tuning to the target dataset.
The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks
TLDR
This work finds that dense, randomly-initialized, feed-forward networks contain subnetworks ("winning tickets") that - when trained in isolation - reach test accuracy comparable to the original network in a similar number of iterations, and articulate the "lottery ticket hypothesis".
...
1
2
3
...