LV-BERT: Exploiting Layer Variety for BERT

  title={LV-BERT: Exploiting Layer Variety for BERT},
  author={Weihao Yu and Zihang Jiang and Fei Chen and Qibin Hou and Jiashi Feng},
Modern pre-trained language models are mostly built upon backbones stacking self-attention and feed-forward layers in an interleaved order. In this paper, beyond this stereotyped layer pattern, we aim to improve pre-trained models by exploiting layer variety from two aspects: the layer type set and the layer order. Specifically, besides the original self-attention and feed-forward layers, we introduce convolution into the layer type set, which is experimentally found beneficial to pre-trained… Expand

Figures and Tables from this paper


BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks. Expand
ALBERT: A Lite BERT for Self-supervised Learning of Language Representations
This work presents two parameter-reduction techniques to lower memory consumption and increase the training speed of BERT, and uses a self-supervised loss that focuses on modeling inter-sentence coherence. Expand
MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices
MobileBERT is a thin version of BERT_LARGE, while equipped with bottleneck structures and a carefully designed balance between self-attentions and feed-forward networks that can be generically applied to various downstream NLP tasks via simple fine-tuning. Expand
Attention is All you Need
A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data. Expand
SMASH: One-Shot Model Architecture Search through HyperNetworks
A technique to accelerate architecture selection by learning an auxiliary HyperNet that generates the weights of a main model conditioned on that model's architecture is proposed, achieving competitive performance with similarly-sized hand-designed networks. Expand
Single Path One-Shot Neural Architecture Search with Uniform Sampling
A Single Path One-Shot model is proposed to construct a simplified supernet, where all architectures are single paths so that weight co-adaption problem is alleviated. Expand
XLNet: Generalized Autoregressive Pretraining for Language Understanding
XLNet is proposed, a generalized autoregressive pretraining method that enables learning bidirectional contexts by maximizing the expected likelihood over all permutations of the factorization order and overcomes the limitations of BERT thanks to its autore progressive formulation. Expand
Language Modeling with Gated Convolutional Networks
A finite context approach through stacked convolutions, which can be more efficient since they allow parallelization over sequential tokens, is developed and is the first time a non-recurrent approach is competitive with strong recurrent models on these large scale language tasks. Expand
Improving Transformer Models by Reordering their Sublayers
This work proposes a new transformer pattern that adheres to this property, the sandwich transformer, and shows that it improves perplexity on multiple word-level and character-level language modeling benchmarks, at no cost in parameters, memory, or training time. Expand
ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators
The contextual representations learned by the proposed replaced token detection pre-training task substantially outperform the ones learned by methods such as BERT and XLNet given the same model size, data, and compute. Expand