BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

  title={BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding},
  author={Jacob Devlin and Ming-Wei Chang and Kenton Lee and Kristina Toutanova},
We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers. [] Key Result It obtains new state-of-the-art results on eleven natural language processing tasks, including pushing the GLUE score to 80.5% (7.7% point absolute improvement), MultiNLI accuracy to 86.7% (4.6% absolute improvement), SQuAD v1.1 question answering Test F1 to 93.2 (1.5 point absolute improvement) and SQuAD v2.0 Test F1 to 83.1 (5.1 point absolute improvement).

Syntax-Infused Transformer and BERT models for Machine Translation and Natural Language Understanding

This work shows that the syntax-infused Transformer with multiple features achieves an improvement of 0.7 BLEU when trained on the full WMT 14 English to German translation dataset, and finds that the incorporation of syntax into BERT fine-tuning outperforms baseline on a number of downstream tasks from the GLUE benchmark.

TRANS-BLSTM: Transformer with Bidirectional LSTM for Language Understanding

It is shown that TRANS-BLSTM models consistently lead to improvements in accuracy compared to BERT baselines in GLUE and SQuAD 1.1 experiments, and is proposed as a joint modeling framework for transformer and BLSTM.

Span Selection Pre-training for Question Answering

This paper introduces a new pre-training task inspired by reading comprehension to better align the pre- training from memorization to understanding, and shows that the proposed model has strong empirical evidence as it obtains SOTA results on Natural Questions, a new benchmark MRC dataset, outperforming BERT-LARGE by 3 F1 points on short answer prediction.

Real-time Inference in Multi-sentence Tasks with Deep Pretrained Transformers

A new architecture, the Poly-encoder, is developed that is designed to approach the performance of the Cross-encoders while maintaining reasonable computation time and achieves state-of-the-art results on both dialogue tasks.

Utilizing Bidirectional Encoder Representations from Transformers for Answer Selection

This paper adopts the pre-trained Bidirectional Encoder Representations from Transformer (BERT) language model and fine-tuning the BERT model for the answer selection task is very effective and observes a maximum improvement in the QA and CQA datasets compared to the previous state-of-the-art models.

BERT for Question Answering on SQuAD 2 . 0

This project picked up BERT model and tried to fine-tune it with additional task-specific layers to improve its performance on Stanford Question Answering Dataset (SQuAD 2.0).

TinyBERT: Distilling BERT for Natural Language Understanding

A novel Transformer distillation method that is specially designed for knowledge distillation (KD) of the Transformer-based models is proposed and, by leveraging this new KD method, the plenty of knowledge encoded in a large “teacher” BERT can be effectively transferred to a small “student” TinyBERT.

Incorporating BERT into Neural Machine Translation

A new algorithm named BERT-fused model is proposed, in which BERT is first used to extract representations for an input sequence, and then the representations are fused with each layer of the encoder and decoder of the NMT model through attention mechanisms.

Improving SQUAD 2 . 0 Performance using BERT + X

The complexity of BERT model is increased to be more specific to the questionanswering task by incorporating highway networks, additional transformer layers and Bidirectional Attention Flow (BiDAF).

What the [MASK]? Making Sense of Language-Specific BERT Models

The current state of the art in language-specific BERT models is presented, providing an overall picture with respect to different dimensions (i.e. architectures, data domains, and tasks), and an immediate and straightforward overview of the commonalities and differences are provided.



Attention is All you Need

A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.

Dissecting Contextual Word Embeddings: Architecture and Representation

There is a tradeoff between speed and accuracy, but all architectures learn high quality contextual representations that outperform word embeddings for four challenging NLP tasks, suggesting that unsupervised biLMs, independent of architecture, are learning much more about the structure of language than previously appreciated.

Semi-supervised sequence tagging with bidirectional language models

A general semi-supervised approach for adding pretrained context embeddings from bidirectional language models to NLP systems and apply it to sequence labeling tasks, surpassing previous systems that use other forms of transfer or joint learning with additional labeled data and task specific gazetteers.

Character-Level Language Modeling with Deeper Self-Attention

This paper shows that a deep (64-layer) transformer model with fixed context outperforms RNN variants by a large margin, achieving state of the art on two popular benchmarks: 1.13 bits per character on text8 and 1.06 on enwik8.

QANet: Combining Local Convolution with Global Self-Attention for Reading Comprehension

A new Q\&A architecture called QANet is proposed, which does not require recurrent networks, and its encoder consists exclusively of convolution and self-attention, where convolution models local interactions andSelf-att attention models global interactions.

GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding

A benchmark of nine diverse NLU tasks, an auxiliary dataset for probing models for understanding of specific linguistic phenomena, and an online platform for evaluating and comparing models, which favors models that can represent linguistic knowledge in a way that facilitates sample-efficient learning and effective knowledge-transfer across tasks.

Skip-Thought Vectors

We describe an approach for unsupervised learning of a generic, distributed sentence encoder. Using the continuity of text from books, we train an encoder-decoder model that tries to reconstruct the

MaskGAN: Better Text Generation via Filling in the ______

This work introduces an actor-critic conditional GAN that fills in missing text conditioned on the surrounding context and shows qualitatively and quantitatively, evidence that this produces more realistic conditional and unconditional text samples compared to a maximum likelihood trained model.

Multi-Granularity Hierarchical Attention Fusion Networks for Reading Comprehension and Question Answering

A novel hierarchical attention network for reading comprehension style question answering, which aims to answer questions for a given narrative paragraph, which achieves state-of-the-art on the both SQuAD and TriviaQA Wiki leaderboard as well as two adversarial S QuAD datasets.

Deep Contextualized Word Representations

A new type of deep contextualized word representation is introduced that models both complex characteristics of word use and how these uses vary across linguistic contexts, allowing downstream models to mix different types of semi-supervision signals.