• Corpus ID: 237941032

On the Prunability of Attention Heads in Multilingual BERT

  title={On the Prunability of Attention Heads in Multilingual BERT},
  author={Aakriti Budhraja and Madhura Pande and Pratyush Kumar and Mitesh M. Khapra},
Large multilingual models, such as mBERT, have shown promise in crosslingual transfer. In this work, we employ pruning to quantify the robustness and interpret layer-wise importance of mBERT. On four GLUE tasks, the relative drops in accuracy due to pruning have almost identical results on mBERT and BERT suggesting that the reduced attention capacity of the multilingual models does not affect robustness to pruning. For the crosslingual task XNLI, we report higher drops in accuracy with pruning… 

Figures and Tables from this paper

Probing Structured Pruning on Multilingual Pre-trained Models: Settings, Algorithms, and Efficiency
This work investigates three aspects of structured pruning on multilingual pre-trained language models: settings, algorithms, and efficiency and presents Dynamic Sparsification, a simple approach that allows training the model once and adapting to different model sizes at inference.


Beto, Bentz, Becas: The Surprising Cross-Lingual Effectiveness of BERT
This paper explores the broader cross-lingual potential of mBERT (multilingual) as a zero shot language transfer model on 5 NLP tasks covering a total of 39 languages from various language families: NLI, document classification, NER, POS tagging, and dependency parsing.
On the Weak Link between Importance and Prunability of Attention Heads
It is found that a large fraction of the attention heads can be randomly pruned with limited effect on accuracy, and the results suggest that interpretation of attention heads does not strongly inform pruning.
Unsupervised Cross-lingual Representation Learning at Scale
It is shown that pretraining multilingual language models at scale leads to significant performance gains for a wide range of cross-lingual transfer tasks, and the possibility of multilingual modeling without sacrificing per-language performance is shown for the first time.
How Multilingual is Multilingual BERT?
It is concluded that M-BERT does create multilingual representations, but that these representations exhibit systematic deficiencies affecting certain language pairs, and that the model can find translation pairs.
Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned
It is found that the most important and confident heads play consistent and often linguistically-interpretable roles and when pruning heads using a method based on stochastic gates and a differentiable relaxation of the L0 penalty, it is observed that specialized heads are last to be pruned.
Are Sixteen Heads Really Better than One?
It is made the surprising observation that even if models have been trained using multiple heads, in practice, a large percentage of attention heads can be removed at test time without significantly impacting performance.
Poor Man's BERT: Smaller and Faster Transformer Models
A number of memory-light model reduction strategies that do not require model pre-training from scratch are explored, which are able to prune BERT, RoBERTa and XLNet models by up to 40%, while maintaining up to 98% of their original performance.
Linguistic Knowledge and Transferability of Contextual Representations
It is found that linear models trained on top of frozen contextual representations are competitive with state-of-the-art task-specific models in many cases, but fail on tasks requiring fine-grained linguistic knowledge.
ALBERT: A Lite BERT for Self-supervised Learning of Language Representations
This work presents two parameter-reduction techniques to lower memory consumption and increase the training speed of BERT, and uses a self-supervised loss that focuses on modeling inter-sentence coherence.
The heads hypothesis: A unifying statistical approach towards understanding multi-headed attention in BERT
A simple yet effective score is formalized that generalizes to all the roles of attention heads and employs hypothesis testing on this score for robust inference and provides the right lens to systematically analyze attention heads to confidently comment on many commonly posed questions on analyzing the BERT model.