EarlyBERT: Efficient BERT Training via Early-bird Lottery Tickets

@inproceedings{Chen2021EarlyBERTEB,
  title={EarlyBERT: Efficient BERT Training via Early-bird Lottery Tickets},
  author={Xiaohan Chen and Yu Cheng and Shuohang Wang and Zhe Gan and Zhangyang Wang and Jingjing Liu},
  booktitle={ACL},
  year={2021}
}
Heavily overparameterized language models such as BERT, XLNet and T5 have achieved impressive success in many NLP tasks. However, their high model complexity requires enormous computation resources and extremely long training time for both pre-training and fine-tuning. Many works have studied model compression on large NLP models, but only focusing on reducing inference time while still requiring an expensive training process. Other works use extremely large batch sizes to shorten the pre… 
Robust Lottery Tickets for Pre-trained Language Models
TLDR
This work proposes a novel method based on learning binary weight masks to identify robust tickets hidden in the original PLMs, and designs an adversarial loss objective to guide the search for robust tickets and ensure that the tickets perform well both in accuracy and robustness.
DSEE: Dually Sparsity-embedded Efficient Tuning of Pre-trained Language Models
TLDR
This work proposes a framework for resourceand parameter-efficient fine-tuning by leveraging the sparsity prior in both weight updates and the final model weights, and leverages sparsity in these two directions by exploiting both unstructured and structured sparse patterns in pre-trained language models via magnitude-based pruning and `1 sparse regularization.
Structured Pruning Learns Compact and Accurate Models
TLDR
This work proposes a task-specific structured pruning method CoFi (Coarse- and Fine-grained Pruning), which delivers highly parallelizable subnetworks and matches the distillation methods in both accuracy and latency, without resorting to any unlabeled data.
On the Compression of Natural Language Models
TLDR
It was showed that typical dense neural networks contain a small sparse sub-network that can be trained to a reach similar test accuracy in an equal number of steps, and the goal of this work is to assess whether such a trainable subnetwork exists for natural language models (NLM)s.
Towards Structured Dynamic Sparse Pre-Training of BERT
TLDR
This work develops and study a straightforward, dynamic always-sparse pre-training approach for BERT language modeling task, which leverages periodic compression steps based on magnitude pruning followed by random parameter re-allocation to achieve Pareto improvements in terms of the number of floating-point operations.
Data-Efficient Double-Win Lottery Tickets from Robust Pre-training
TLDR
This paper forms a more rigorous concept, DoubleWin Lottery Tickets, in which a located subnetwork from a pre-trained model can be independently transferred on diverse downstream tasks, to reach BOTH the same standard and robust generalization, under BOTH standard and adversarial training regimes, as the full pre- trained model can do.
Ultra-Data-Efficient GAN Training: Drawing A Lottery Ticket First, Then Training It Toughly
TLDR
This work decomposes the data-hungry GAN training into two sequential sub-problems: identifying the lottery ticket from the original GAN; then training the found sparse subnetwork with aggressive data and feature augmentations, effectively stabilizing training and improving convergence.
A Unified Lottery Ticket Hypothesis for Graph Neural Networks
TLDR
A unified GNN sparsification (UGS) framework that simultaneously prunes the graph adjacency matrix and the model weights, for effectively accelerating GNN inference on large-scale graphs is presented and the recently popular lottery ticket hypothesis is generalized to GNNs for the first time.
Attribution-based Task-specific Pruning for Multi-task Language Models
TLDR
Experimental results on the six widely-used datasets show that the proposed pruning method significantly outperforms baseline compression methods and is extended to be applicable in a low-resource setting, where the number of labeled datasets is insuf ficient.
Learning to Win Lottery Tickets in BERT Transfer via Task-agnostic Mask Training
TLDR
This paper discovers that the success of magnitude pruning can be at-tributed to the preserved pre- training performance, which correlates with the downstream transferability, and proposes to directly optimize the subnetwork structure towards the pre-training objectives, which can better preserve the pre -training performance.
...
...

References

SHOWING 1-10 OF 46 REFERENCES
The Lottery Tickets Hypothesis for Supervised and Self-supervised Pre-training in Computer Vision Models
TLDR
This paper examines supervised and self-supervised pre-trained models through the lens of the lottery ticket hypothesis (LTH) and concludes that the core LTH observations remain generally relevant in the pre-training paradigm of computer vision, but more delicate discussions are needed in some cases.
When BERT Plays the Lottery, All Tickets Are Winning
TLDR
It is shown that the "bad" subnetworks can be fine-tuned separately to achieve only slightly worse performance than the "good" ones, indicating that most weights in the pre-trained BERT are potentially useful.
Reducing Transformer Depth on Demand with Structured Dropout
TLDR
LayerDrop, a form of structured dropout, is explored, which has a regularization effect during training and allows for efficient pruning at inference time, and shows that it is possible to select sub-networks of any depth from one large network without having to finetune them and with limited impact on performance.
Learning Efficient Convolutional Networks through Network Slimming
TLDR
The approach is called network slimming, which takes wide and large networks as input models, but during training insignificant channels are automatically identified and pruned afterwards, yielding thin and compact models with comparable accuracy.
Drawing early-bird tickets: Towards more efficient training of deep networks
TLDR
This paper discovers for the first time that the winning tickets can be identified at the very early training stage, which it is term as early-bird (EB) tickets, via low-cost training schemes at large learning rates, consistent with recently reported observations that the key connectivity patterns of neural networks emerge early.
MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices
TLDR
MobileBERT is a thin version of BERT_LARGE, while equipped with bottleneck structures and a carefully designed balance between self-attentions and feed-forward networks that can be generically applied to various downstream NLP tasks via simple fine-tuning.
Playing Lottery Tickets with Vision and Language
TLDR
This work uses UNITER, one of the best-performing V+L models, as the testbed, and conducts the first empirical study to assess whether trainable subnetworks also exist in pre-trained V+ L models.
Ultra-Data-Efficient GAN Training: Drawing A Lottery Ticket First, Then Training It Toughly
TLDR
This work decomposes the data-hungry GAN training into two sequential sub-problems: identifying the lottery ticket from the original GAN; then training the found sparse subnetwork with aggressive data and feature augmentations, effectively stabilizing training and improving convergence.
Contrastive Distillation on Intermediate Representations for Language Model Compression
TLDR
CoDIR is proposed, a principled knowledge distillation framework where the student is trained to distill knowledge through intermediate layers of the teacher via a contrastive objective, and achieves superb performance on the GLUE benchmark, outperforming state-of-the-art compression methods.
Hopfield Networks is All You Need
TLDR
A new PyTorch layer is provided, called "Hopfield", which allows to equip deep learning architectures with modern Hopfield networks as a new powerful concept comprising pooling, memory, and attention.
...
...