ERNIE-SPARSE: Learning Hierarchical Efficient Transformer Through Regularized Self-Attention

  title={ERNIE-SPARSE: Learning Hierarchical Efficient Transformer Through Regularized Self-Attention},
  author={Yang Liu and Jiaxiang Liu and Li Jie Chen and Yuxiang Lu and Shi Feng and Zhidan Feng and Yu Sun and Hao Tian and Huancheng Wu and Hai-feng Wang},
Sparse Transformer has recently attracted a lot 001 of attention since the ability for reducing the 002 quadratic dependency on the sequence length. 003 We argue that two factors, information bot- 004 tleneck sensitivity and inconsistency between 005 different attention topologies , could affect the 006 performance of the Sparse Transformer. This 007 paper proposes a well-designed model named 008 ERNIE-S PARSE . It consists of two distinc- 009 tive parts: (i) H ierarchical S parse T ransformer… 

An Exploration of Hierarchical Attention Transformers for Efficient Long Document Classification

This work develops and releases fully pre-trained HAT models that use segment-wise followed by crosssegment encoders and compares them with Longformer models and partially pre- trained HATs to find that H ATs perform best with cross-segment contextualization throughout the model than alternative configurations that implement either early or late cross-Se segment contextualization.

On Learning the Transformer Kernel

KL-T RANSFORMER is introduced, a generic, scalable, data driven framework for learning the kernel function in Transformers that approximates the Transformer kernel as a dot product between spectral feature maps and learns the kernel by learning the spectral distribution.



RoBERTa: A Robustly Optimized BERT Pretraining Approach

It is found that BERT was significantly undertrained, and can match or exceed the performance of every model published after it, and the best model achieves state-of-the-art results on GLUE, RACE and SQuAD.

ListOps: A Diagnostic Dataset for Latent Tree Learning

It is shown that the current leading latent tree models are unable to learn to parse and succeed at ListOps, a toy dataset created to study the parsing ability of latentTree models.

Constructing Datasets for Multi-hop Reading Comprehension Across Documents

A novel task to encourage the development of models for text understanding across multiple documents and to investigate the limits of existing methods, in which a model learns to seek and combine evidence — effectively performing multihop, alias multi-step, inference.

R-Drop: Regularized Dropout for Neural Networks

A simple consistency training strategy to regularize dropout, namely R-Drop, which forces the output distributions of different sub models gen-erated by dropout to be consistent with each other.

Longformer: The Long-Document Transformer

Following prior work on long-sequence transformers, the Longformer is evaluated on character-level language modeling and achieves state-of-the-art results on text8 and enwik8 and pretrain Longformer and finetune it on a variety of downstream tasks.

ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators

The contextual representations learned by the proposed replaced token detection pre-training task substantially outperform the ones learned by methods such as BERT and XLNet given the same model size, data, and compute.

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.

Long Range Arena: A Benchmark for Efficient Transformers

A systematic and unified benchmark, LRA, specifically focused on evaluating model quality under long-context scenarios, paves the way towards better understanding this class of efficient Transformer models, facilitates more research in this direction, and presents new challenging tasks to tackle.

ETC: Encoding Long and Structured Data in Transformers

A new family of Transformer models is presented, which is called the Extended Transformer Construction (ETC), that allows for significant increases in input sequence length by introducing a new globallocal attention mechanism between a global memory and the standard input tokens.

Blockwise Self-Attention for Long Document Understanding

This model extends BERT by introducing sparse block structures into the attention matrix to reduce both memory consumption and training/inference time, which also enables attention heads to capture either short- or long-range contextual information.