AdapterDrop: On the Efficiency of Adapters in Transformers

@inproceedings{Rckl2021AdapterDropOT,
  title={AdapterDrop: On the Efficiency of Adapters in Transformers},
  author={Andreas R{\"u}ckl{\'e} and Gregor Geigle and Max Glockner and Tilman Beck and Jonas Pfeiffer and Nils Reimers and Iryna Gurevych},
  booktitle={EMNLP},
  year={2021}
}
Transformer models are expensive to fine-tune, slow for inference, and have large storage requirements. Recent approaches tackle these shortcomings by training smaller models, dynamically reducing the model size, and by training light-weight adapters. In this paper, we propose AdapterDrop, removing adapters from lower transformer layers during training and inference, which incorporates concepts from all three directions. We show that AdapterDrop can dynamically reduce the computational overhead… 
On the Effectiveness of Adapter-based Tuning for Pretrained Language Model Adaptation
TLDR
It is demonstrated that 1) adapter-based tuning outperforms fine-tuning on low-resource and cross-lingual tasks; 2) it is more robust to overfitting and less sensitive to changes in learning rates.
AdapterBias: Parameter-efficient Token-dependent Representation Shift for Adapters in NLP Tasks
TLDR
The proposed AdapterBias adds a token-dependent shift to the hidden output of transformer layers to adapt to downstream tasks with only a vector and a linear layer to dramatically reduce the trainable parameters.
COMPACTER: Efficient Low-Rank Hypercomplex Adapter Layers
TLDR
This work proposes COMPACTER, a method for fine-tuning large-scale language models with a better trade-off between task performance and the number of trainable parameters than prior work.
Compacter: Efficient Low-Rank Hypercomplex Adapter Layers
TLDR
This work proposes COMPACTER, a method for fine-tuning large-scale language models with a better trade-off between task performance and the number of trainable parameters than prior work, and accomplishes this by building on top of ideas from adapters, low-rank optimization, and parameterized hypercomplex multiplication layers.
AdapterHub Playground: Simple and Flexible Few-Shot Learning with Adapters
TLDR
The AdapterHub Playground provides an intuitive interface, allowing the usage of adapters for prediction, training and analysis of textual data for a variety of NLP tasks, and presents the tool’s architecture and demonstrates its advantages with prototypical use-cases.
Multilingual Domain Adaptation for NMT: Decoupling Language and Domain Information with Adapters
TLDR
This work study the compositionality of language and domain adapters in the context of Machine Translation, and aims to study parameter-efficient adaptation to multiple domains and languages simultaneously and cross-lingual transfer in domains where parallel data is unavailable for certain language pairs.
MAD-G: Multilingual Adapter Generation for Efficient Cross-Lingual Transfer
TLDR
MAD-G (Multilingual ADapter Generation), which contextually generates language adapters from language representations based on typological features, offers substantial benefits for low-resource languages, particularly on the NER task in low- resource African languages.
Communication-Efficient Federated Learning for Neural Machine Translation
TLDR
This paper explores how to efficiently build NMT models in an FL setup by proposing a novel solution to reduce the communication overhead and notes that the models equipped with Controllers preform on par with those trained in a central and non-FL setting.
Training Mixed-Domain Translation Models via Federated Learning
TLDR
This work demonstrates that with slight modifications in the training process, neural machine translation (NMT) engines can be easily adapted when an FL-based aggregation is applied to fuse different domains and proposes a novel technique to dynamically control the communication bandwidth by selecting impactful parameters during FL updates.
Revisiting Pretraining with Adapters
TLDR
This work explores alternatives to full-scale task-specific pretraining of language models through the use of adapter modules, a parameter-efficient approach to transfer learning and finds that adapter-based pretraining is able to achieve comparable results to task- specific pretraining while using a fraction of the overall trainable parameters.
...
1
2
3
4
...

References

SHOWING 1-10 OF 39 REFERENCES
Reducing Transformer Depth on Demand with Structured Dropout
TLDR
LayerDrop, a form of structured dropout, is explored, which has a regularization effect during training and allows for efficient pruning at inference time, and shows that it is possible to select sub-networks of any depth from one large network without having to finetune them and with limited impact on performance.
DeeBERT: Dynamic Early Exiting for Accelerating BERT Inference
TLDR
This work proposes a simple but effective method, DeeBERT, to accelerate BERT inference, which allows samples to exit earlier without passing through the entire model, and provides new ideas to efficiently apply deep transformer-based models to downstream tasks.
Depth-Adaptive Transformer
TLDR
This paper trains Transformer models which can make output predictions at different stages of the network and investigates different ways to predict how much computation is required for a particular sequence.
MAD-G: Multilingual Adapter Generation for Efficient Cross-Lingual Transfer
TLDR
MAD-G (Multilingual ADapter Generation), which contextually generates language adapters from language representations based on typological features, offers substantial benefits for low-resource languages, particularly on the NER task in low- resource African languages.
Parameter-Efficient Transfer Learning for NLP
TLDR
To demonstrate adapter's effectiveness, the recently proposed BERT Transformer model is transferred to 26 diverse text classification tasks, including the GLUE benchmark, and adapter attain near state-of-the-art performance, whilst adding only a few parameters per task.
MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices
TLDR
MobileBERT is a thin version of BERT_LARGE, while equipped with bottleneck structures and a carefully designed balance between self-attentions and feed-forward networks that can be generically applied to various downstream NLP tasks via simple fine-tuning.
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
TLDR
A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.
Orthogonal Language and Task Adapters in Zero-Shot Cross-Lingual Transfer
TLDR
This work proposes orthogonal language and task adapters (dubbed orthoadapters) for cross-lingual transfer that are trained to encode language- and task-specific information that is complementary to the knowledge already stored in the pretrained transformer's parameters.
BERT and PALs: Projected Attention Layers for Efficient Adaptation in Multi-Task Learning
TLDR
Using new adaptation modules, PALs or `projected attention layers', this work matches the performance of separately fine-tuned models on the GLUE benchmark with roughly 7 times fewer parameters, and obtains state-of-the-art results on the Recognizing Textual Entailment dataset.
DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter
TLDR
This work proposes a method to pre-train a smaller general-purpose language representation model, called DistilBERT, which can be fine-tuned with good performances on a wide range of tasks like its larger counterparts, and introduces a triple loss combining language modeling, distillation and cosine-distance losses.
...
1
2
3
4
...