Corpus ID: 219531210

DeBERTa: Decoding-enhanced BERT with Disentangled Attention

@article{He2021DeBERTaDB,
  title={DeBERTa: Decoding-enhanced BERT with Disentangled Attention},
  author={Pengcheng He and Xiaodong Liu and Jianfeng Gao and Weizhu Chen},
  journal={ArXiv},
  year={2021},
  volume={abs/2006.03654}
}
Recent progress in pre-trained neural language models has significantly improved the performance of many natural language processing (NLP) tasks. In this paper we propose a new model architecture DeBERTa (Decoding-enhanced BERT with disentangled attention) that improves the BERT and RoBERTa models using two novel techniques. The first is the disentangled attention mechanism, where each word is represented using two vectors that encode its content and position, respectively, and the attention… Expand

Figures and Tables from this paper

A Simple but Tough-to-Beat Data Augmentation Approach for Natural Language Understanding and Generation
TLDR
A set of simple yet effective data augmentation strategies dubbed cutoff, where part of the information within an input sentence is erased to yield its restricted views (during the fine-tuning stage), which consistently outperforms adversarial training and achieves state-of-the-art results on the IWSLT2014 German-English dataset. Expand
Finetuning Pretrained Transformers into RNNs
TLDR
This work proposes a swap-then-finetune procedure: in an off-the-shelf pretrained transformer, the softmax attention is replaced with its linear-complexity recurrent alternative and then finetune, which provides an improved tradeoff between efficiency and accuracy over the standard transformer and other recurrent variants. Expand
XTREME-R: Towards More Challenging and Nuanced Multilingual Evaluation
TLDR
This paper analyzes the current state of cross-lingual transfer learning and summarizes some lessons learned and provides a massively multilingual diagnostic suite (MULTICHECKLIST) and finegrained multi-dataset evaluation capabilities through an interactive public leaderboard to gain a better understanding of such models. Expand
COCO-LM: Correcting and Contrasting Text Sequences for Language Model Pretraining
TLDR
COCO-LM outperforms recent pretraining approaches in various pretraining settings and few-shot evaluations, with higher pretraining efficiency, and its advantages come from its challenging training signals, more contextualized token representations, and regularized sequence representations. Expand
ZJUKLAB at SemEval-2021 Task 4: Negative Augmentation with Language Model for Reading Comprehension of Abstract Meaning
TLDR
This paper presents systems for the three Subtasks of SemEval Task4: Reading Comprehension of Abstract Meaning (ReCAM), and proposes a simple yet effective technology, namely, negative augmentation with language model. Expand
FULLY-EXPLORED MASKED LANGUAGE MODEL
Masked Language Model (MLM) framework has been widely adopted for selfsupervised language pre-training. In this paper, we argue that randomly sampled masks in MLM would lead to undesirably largeExpand
Robustly Optimized and Distilled Training for Natural Language Understanding
TLDR
This paper uses the MTL enhanced representation across several natural language understanding tasks to improve performance and generalization and incorporates knowledge distillation (KD) in MTL to further boost performance and devise a KD variant that learns effectively from multiple teachers. Expand
Interpreting A Pre-trained Model Is A Key For Model Architecture Optimization: A Case Study On Wav2Vec 2.0
TLDR
An innovative perspective for analyzing attention patterns is proposed: summarize block-level patterns and assume abnormal patterns contribute negative influence, which identifies that avoiding abnormal patterns is the main contributor for performance boosting. Expand
Using Prior Knowledge to Guide BERT’s Attention in Semantic Textual Matching Tasks
TLDR
This work directly injects knowledge into BERT’s multi-head attention mechanism and is able to consistently improve semantic textual matching performance over the original BERT model, and the performance benefit is most salient when training data is scarce. Expand
Improving Self-supervised Pre-training via a Fully-Explored Masked Language Model
TLDR
It is proved, from a theoretical perspective, that the gradients derived from this new masking schema have a smaller variance and can lead to more efficient self-supervised training. Expand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 53 REFERENCES
ALBERT: A Lite BERT for Self-supervised Learning of Language Representations
TLDR
This work presents two parameter-reduction techniques to lower memory consumption and increase the training speed of BERT, and uses a self-supervised loss that focuses on modeling inter-sentence coherence. Expand
Multi-Task Deep Neural Networks for Natural Language Understanding
TLDR
A Multi-Task Deep Neural Network (MT-DNN) for learning representations across multiple natural language understanding (NLU) tasks that allows domain adaptation with substantially fewer in-domain labels than the pre-trained BERT representations. Expand
StructBERT: Incorporating Language Structures into Pre-training for Deep Language Understanding
TLDR
Inspired by the linearization exploration work of Elman, BERT is extended to a new model, StructBERT, by incorporating language structures into pre-training, and the new model is adapted to different levels of language understanding required by downstream tasks. Expand
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
TLDR
A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks. Expand
Attention is All you Need
TLDR
A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data. Expand
Unified Language Model Pre-training for Natural Language Understanding and Generation
TLDR
A new Unified pre-trained Language Model (UniLM) that can be fine-tuned for both natural language understanding and generation tasks that compares favorably with BERT on the GLUE benchmark, and the SQuAD 2.0 and CoQA question answering tasks. Expand
Self-Attention with Relative Position Representations
TLDR
This work presents an alternative approach, extending the self-attention mechanism to efficiently consider representations of the relative positions, or distances between sequence elements, on the WMT 2014 English-to-German and English- to-French translation tasks. Expand
ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators
TLDR
The contextual representations learned by the proposed replaced token detection pre-training task substantially outperform the ones learned by methods such as BERT and XLNet given the same model size, data, and compute. Expand
Language Models are Unsupervised Multitask Learners
TLDR
It is demonstrated that language models begin to learn these tasks without any explicit supervision when trained on a new dataset of millions of webpages called WebText, suggesting a promising path towards building language processing systems which learn to perform tasks from their naturally occurring demonstrations. Expand
XLNet: Generalized Autoregressive Pretraining for Language Understanding
TLDR
XLNet is proposed, a generalized autoregressive pretraining method that enables learning bidirectional contexts by maximizing the expected likelihood over all permutations of the factorization order and overcomes the limitations of BERT thanks to its autore progressive formulation. Expand
...
1
2
3
4
5
...