• Corpus ID: 204509627

# HuggingFace's Transformers: State-of-the-art Natural Language Processing

@article{Wolf2019HuggingFacesTS,
title={HuggingFace's Transformers: State-of-the-art Natural Language Processing},
author={Thomas Wolf and Lysandre Debut and Victor Sanh and Julien Chaumond and Clement Delangue and Anthony Moi and Pierric Cistac and Tim Rault and R{\'e}mi Louf and Morgan Funtowicz and Jamie Brew},
journal={ArXiv},
year={2019},
volume={abs/1910.03771}
}
• Published 9 October 2019
• Computer Science
• ArXiv
Recent progress in natural language processing has been driven by advances in both model architecture and model pretraining. Transformer architectures have facilitated building higher-capacity models and pretraining has made it possible to effectively utilize this capacity for a wide variety of tasks. \textit{Transformers} is an open-source library with the goal of opening up these advances to the wider machine learning community. The library consists of carefully engineered state-of-the art…
3,006 Citations

## Figures from this paper

### Poor Man's BERT: Smaller and Faster Transformer Models

• Computer Science
ArXiv
• 2020
A number of memory-light model reduction strategies that do not require model pre-training from scratch are explored, which are able to prune BERT, RoBERTa and XLNet models by up to 40%, while maintaining up to 98% of their original performance.

### Entity Matching with Transformer Architectures - A Step Forward in Data Integration

• Computer Science
EDBT
• 2020
This paper empirically compares the capability of transformer architectures and transfer-learning on the task of EM and shows that transformer architectures outperform classical deep learning methods in EM by an average margin of 27.5%.

• Computer Science
AIED
• 2020
This work trains the newest and most powerful, according to the glue benchmark, transformers on the SemEval-2013 dataset, and shows that models trained with knowledge distillation are feasible for use in short answer grading.

### Stress Test Evaluation of Transformer-based Models in Natural Language Understanding Tasks

• Computer Science
LREC
• 2020
Evaluated Transformer-based models in Natural Language Inference and Question Answering tasks reveal that RoBERTa, XLNet and BERT are more robust than recurrent neural network models to stress tests for both NLI and QA tasks, revealing that there is still room for future improvement in this field.

### Directed Beam Search: Plug-and-Play Lexically Constrained Language Generation

• Computer Science
ArXiv
• 2020
Directed Beam Search is proposed, a plug-and-play method for lexically constrained language generation that can be applied to any language model, is easy to implement and can be used for general language generation.

### GMAT: Global Memory Augmentation for Transformers

• Computer Science
ArXiv
• 2020
This work proposes to augment sparse Transformer blocks with a dense attention-based $\textit{global memory}$ of length $M$ ($\ll L$) which provides an aggregate global view of the entire input sequence to each position, and empirically shows that this method leads to substantial improvement on a range of tasks.

• Computer Science
CHR
• 2020
This paper can confirm results from a recent study that continuing pretraining on the domain and the task data substantially improves task performance, and training a model from scratch using Electra is not competitive for the authors' data sets.

### RuSentEval: Linguistic Source, Encoder Force!

• Linguistics
BSNLP
• 2021
RuSentEval is introduced, an enhanced set of 14 probing tasks for Russian, including ones that have not been explored yet, to explore the distribution of various linguistic properties in five multilingual transformers for two typologically contrasting languages.

### TransQuest: Translation Quality Estimation with Cross-lingual Transformers

• Computer Science
COLING
• 2020
A simple QE framework based on cross-lingual transformers is proposed, and it is used to implement and evaluate two different neural architectures, achieving state-of-the-art results outperforming current open-source quality estimation frameworks when trained on datasets from WMT.

### Adaptation of Deep Bidirectional Transformers for Afrikaans Language

The results show that AfriBERT improves the current state-of-the-art in most of the tasks the authors considered, and that transfer learning from multilingual to monolingual model can have a significant performance improvement on downstream tasks.

## References

SHOWING 1-10 OF 80 REFERENCES

### BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

• Computer Science
NAACL
• 2019
A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.

### AllenNLP: A Deep Semantic Natural Language Processing Platform

• Computer Science
ArXiv
• 2018
AllenNLP is described, a library for applying deep learning methods to NLP research that addresses issues with easy-to-use command-line tools, declarative configuration-driven experiments, and modular NLP abstractions.

### Reformer: The Efficient Transformer

• Computer Science
ICLR
• 2020
This work replaces dot-product attention by one that uses locality-sensitive hashing and uses reversible residual layers instead of the standard residuals, which allows storing activations only once in the training process instead of several times, making the model much more memory-efficient and much faster on long sequences.

### Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

• Computer Science
J. Mach. Learn. Res.
• 2020
This systematic study compares pre-training objectives, architectures, unlabeled datasets, transfer approaches, and other factors on dozens of language understanding tasks and achieves state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more.

### DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

• Computer Science
ArXiv
• 2019
This work proposes a method to pre-train a smaller general-purpose language representation model, called DistilBERT, which can be fine-tuned with good performances on a wide range of tasks like its larger counterparts, and introduces a triple loss combining language modeling, distillation and cosine-distance losses.

### ALBERT: A Lite BERT for Self-supervised Learning of Language Representations

• Computer Science
ICLR
• 2020
This work presents two parameter-reduction techniques to lower memory consumption and increase the training speed of BERT, and uses a self-supervised loss that focuses on modeling inter-sentence coherence.

### SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems

• Computer Science
NeurIPS
• 2019
A new benchmark styled after GLUE is presented, a new set of more difficult language understanding tasks, a software toolkit, and a public leaderboard are presented.

### Transformer-XL: Attentive Language Models beyond a Fixed-Length Context

• Computer Science
ACL
• 2019
This work proposes a novel neural architecture Transformer-XL that enables learning dependency beyond a fixed length without disrupting temporal coherence, which consists of a segment-level recurrence mechanism and a novel positional encoding scheme.

### exBERT: A Visual Analysis Tool to Explore Learned Representations in Transformer Models

• Computer Science
ACL
• 2020
ExBERT provides insights into the meaning of the contextual representations and attention by matching a human-specified input to similar contexts in large annotated datasets, and can quickly replicate findings from literature and extend them to previously not analyzed models.

### Transfer Learning in Natural Language Processing

• Computer Science
NAACL
• 2019
An overview of modern transfer learning methods in NLP, how models are pre-trained, what information the representations they learn capture, and review examples and case studies on how these models can be integrated and adapted in downstream NLP tasks are presented.