ERNIE-SPARSE: Learning Hierarchical Efficient Transformer Through Regularized Self-Attention
@article{Liu2022ERNIESPARSELH, title={ERNIE-SPARSE: Learning Hierarchical Efficient Transformer Through Regularized Self-Attention}, author={Yang Liu and Jiaxiang Liu and Li Jie Chen and Yuxiang Lu and Shi Feng and Zhidan Feng and Yu Sun and Hao Tian and Huancheng Wu and Hai-feng Wang}, journal={ArXiv}, year={2022}, volume={abs/2203.12276} }
Sparse Transformer has recently attracted a lot 001 of attention since the ability for reducing the 002 quadratic dependency on the sequence length. 003 We argue that two factors, information bot- 004 tleneck sensitivity and inconsistency between 005 different attention topologies , could affect the 006 performance of the Sparse Transformer. This 007 paper proposes a well-designed model named 008 ERNIE-S PARSE . It consists of two distinc- 009 tive parts: (i) H ierarchical S parse T ransformer…
Figures and Tables from this paper
3 Citations
An Exploration of Hierarchical Attention Transformers for Efficient Long Document Classification
- Computer ScienceArXiv
- 2022
This work develops and releases fully pre-trained HAT models that use segment-wise followed by crosssegment encoders and compares them with Longformer models and partially pre- trained HATs to find that H ATs perform best with cross-segment contextualization throughout the model than alternative configurations that implement either early or late cross-Se segment contextualization.
Recurrent Memory Transformer
- Computer ScienceArXiv
- 2022
Recurrent Memory Transformer is a promising architecture for applications that require learning of long-term dependencies and general purpose in memory processing, such as algorithmic tasks and reasoning.
On Learning the Transformer Kernel
- Computer ScienceArXiv
- 2021
KL-T RANSFORMER is introduced, a generic, scalable, data driven framework for learning the kernel function in Transformers that approximates the Transformer kernel as a dot product between spectral feature maps and learns the kernel by learning the spectral distribution.
References
SHOWING 1-10 OF 51 REFERENCES
RoBERTa: A Robustly Optimized BERT Pretraining Approach
- Computer ScienceArXiv
- 2019
It is found that BERT was significantly undertrained, and can match or exceed the performance of every model published after it, and the best model achieves state-of-the-art results on GLUE, RACE and SQuAD.
ListOps: A Diagnostic Dataset for Latent Tree Learning
- Computer ScienceNAACL
- 2018
It is shown that the current leading latent tree models are unable to learn to parse and succeed at ListOps, a toy dataset created to study the parsing ability of latentTree models.
Constructing Datasets for Multi-hop Reading Comprehension Across Documents
- Computer ScienceTACL
- 2018
A novel task to encourage the development of models for text understanding across multiple documents and to investigate the limits of existing methods, in which a model learns to seek and combine evidence — effectively performing multihop, alias multi-step, inference.
R-Drop: Regularized Dropout for Neural Networks
- Computer ScienceNeurIPS
- 2021
A simple consistency training strategy to regularize dropout, namely R-Drop, which forces the output distributions of different sub models gen-erated by dropout to be consistent with each other.
Longformer: The Long-Document Transformer
- Computer ScienceArXiv
- 2020
Following prior work on long-sequence transformers, the Longformer is evaluated on character-level language modeling and achieves state-of-the-art results on text8 and enwik8 and pretrain Longformer and finetune it on a variety of downstream tasks.
ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators
- Computer ScienceICLR
- 2020
The contextual representations learned by the proposed replaced token detection pre-training task substantially outperform the ones learned by methods such as BERT and XLNet given the same model size, data, and compute.
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
- Computer ScienceNAACL
- 2019
A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.
Long Range Arena: A Benchmark for Efficient Transformers
- Computer ScienceICLR
- 2021
A systematic and unified benchmark, LRA, specifically focused on evaluating model quality under long-context scenarios, paves the way towards better understanding this class of efficient Transformer models, facilitates more research in this direction, and presents new challenging tasks to tackle.
ETC: Encoding Long and Structured Data in Transformers
- Computer ScienceArXiv
- 2020
A new family of Transformer models is presented, which is called the Extended Transformer Construction (ETC), that allows for significant increases in input sequence length by introducing a new globallocal attention mechanism between a global memory and the standard input tokens.
Blockwise Self-Attention for Long Document Understanding
- Computer ScienceFINDINGS
- 2020
This model extends BERT by introducing sparse block structures into the attention matrix to reduce both memory consumption and training/inference time, which also enables attention heads to capture either short- or long-range contextual information.