Transformers: State-of-the-Art Natural Language Processing

@inproceedings{Wolf2020TransformersSN,
  title={Transformers: State-of-the-Art Natural Language Processing},
  author={Thomas Wolf and Lysandre Debut and Victor Sanh and Julien Chaumond and Clement Delangue and Anthony Moi and Pierric Cistac and Tim Rault and R{\'e}mi Louf and Morgan Funtowicz and Jamie Brew},
  booktitle={EMNLP},
  year={2020}
}
Recent progress in natural language processing has been driven by advances in both model architecture and model pretraining. Transformer architectures have facilitated building higher-capacity models and pretraining has made it possible to effectively utilize this capacity for a wide variety of tasks. Transformers is an open-source library with the goal of opening up these advances to the wider machine learning community. The library consists of carefully engineered state-of-the art Transformer… 

Figures from this paper

AMMUS : A Survey of Transformer-based Pretrained Models in Natural Language Processing

TLDR
This comprehensive survey paper explains various core concepts like pretraining, Pretraining methods, pretraining tasks, embeddings and downstream adaptation methods, presents a new taxonomy of T-PTLMs and gives brief overview of various benchmarks including both intrinsic and extrinsic.

Database Tuning using Natural Language Processing

TLDR
These advances motivate new use cases for NLP methods in the context of databases, and large language models that have been pre-trained on generic tasks are used as a starting point.

Pre-Training a Graph Recurrent Network for Language Representation

TLDR
A graph recurrent network for language model pre-training, which builds a graph structure for each sequence with local token-level communications, together with a sentence-level representation decoupled from other tokens, and can generate more diverse outputs with less contextualized feature redundancy than existing attention-based models.

Efficient Sparsely Activated Transformers

TLDR
A novel system named PLANER is introduced that takes an existing Transformer-based network and a user-defined latency target and produces an optimized, sparsely-activated version of the original network that tries to meet the latency target while maintaining baseline accuracy.

Explaining transformer-based models for automatic short answer grading

TLDR
This work proposes a framework by which this decision can be made, and assesses several popular transformer-based models with various explainability methods on the widely used benchmark dataset from Semeval-2013.

Utilizing Bidirectional Encoder Representations from Transformers for Answer Selection

TLDR
This paper adopts the pre-trained Bidirectional Encoder Representations from Transformer (BERT) language model and fine-tuning the BERT model for the answer selection task is very effective and observes a maximum improvement in the QA and CQA datasets compared to the previous state-of-the-art models.

Sequence-to-Sequence Lexical Normalization with Multilingual Transformers

TLDR
This work proposes a sentence-level sequence-to-sequence model based on mBART, which frames the problem as a machine translation problem, and improves performance on extrinsic, downstream tasks through normalization compared to models operating on raw, unprocessed, social media text.

AutoTinyBERT: Automatic Hyper-parameter Optimization for Efficient Pre-trained Language Models

TLDR
This paper carefully design the techniques of one-shot learning and the search space to provide an adaptive and efficient development way of tiny PLMs for various latency constraints and proposes a more efficient development method that is even faster than the development of a single PLM.

Exploring Transformers in Natural Language Generation: GPT, BERT, and XLNet

TLDR
Three major Transformer-based models are explored, namely GPT, BERT, and XLNet, that carry significant implications for the field of Natural Language Generation.

Long-Span Summarization via Local Attention and Content Selection

TLDR
This work exploits large pre-trained transformer-based models and address long-span dependencies in abstractive summarization using two methods: local self-attention; and explicit content selection, which can achieve comparable or better results than existing approaches.
...

References

SHOWING 1-10 OF 48 REFERENCES

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

TLDR
A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.

AllenNLP: A Deep Semantic Natural Language Processing Platform

TLDR
AllenNLP is described, a library for applying deep learning methods to NLP research that addresses issues with easy-to-use command-line tools, declarative configuration-driven experiments, and modular NLP abstractions.

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

TLDR
This systematic study compares pre-training objectives, architectures, unlabeled datasets, transfer approaches, and other factors on dozens of language understanding tasks and achieves state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more.

Reformer: The Efficient Transformer

TLDR
This work replaces dot-product attention by one that uses locality-sensitive hashing and uses reversible residual layers instead of the standard residuals, which allows storing activations only once in the training process instead of several times, making the model much more memory-efficient and much faster on long sequences.

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

TLDR
This work proposes a method to pre-train a smaller general-purpose language representation model, called DistilBERT, which can be fine-tuned with good performances on a wide range of tasks like its larger counterparts, and introduces a triple loss combining language modeling, distillation and cosine-distance losses.

SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems

TLDR
A new benchmark styled after GLUE is presented, a new set of more difficult language understanding tasks, a software toolkit, and a public leaderboard are presented.

ALBERT: A Lite BERT for Self-supervised Learning of Language Representations

TLDR
This work presents two parameter-reduction techniques to lower memory consumption and increase the training speed of BERT, and uses a self-supervised loss that focuses on modeling inter-sentence coherence.

FLAIR: An Easy-to-Use Framework for State-of-the-Art NLP

TLDR
The core idea of the FLAIR framework is to present a simple, unified interface for conceptually very different types of word and document embeddings, which effectively hides all embedding-specific engineering complexity and allows researchers to “mix and match” variousembeddings with little effort.

FlauBERT: Unsupervised Language Model Pre-training for French

TLDR
This paper introduces and shares FlauBERT, a model learned on a very large and heterogeneous French corpus and applies it to diverse NLP tasks and shows that most of the time they outperform other pre-training approaches.

Attention is All you Need

TLDR
A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.