BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

@article{Devlin2019BERTPO,
  title={BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding},
  author={Jacob Devlin and Ming-Wei Chang and Kenton Lee and Kristina Toutanova},
  journal={ArXiv},
  year={2019},
  volume={abs/1810.04805}
}
We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers. [] Key Result It obtains new state-of-the-art results on eleven natural language processing tasks, including pushing the GLUE score to 80.5% (7.7% point absolute improvement), MultiNLI accuracy to 86.7% (4.6% absolute improvement), SQuAD v1.1 question answering Test F1 to 93.2 (1.5 point absolute improvement) and SQuAD v2.0 Test F1 to 83.1 (5.1 point absolute improvement).
Syntax-Infused Transformer and BERT models for Machine Translation and Natural Language Understanding
TLDR
This work shows that the syntax-infused Transformer with multiple features achieves an improvement of 0.7 BLEU when trained on the full WMT 14 English to German translation dataset, and finds that the incorporation of syntax into BERT fine-tuning outperforms baseline on a number of downstream tasks from the GLUE benchmark.
Extending Answer Prediction for Deep Bi-directional Transformers
TLDR
This work investigates alternative ways to interpret and process BERT encoding outputs, including Pointer-Net, Dynamic Pointing Decoder, and Dynamic Chunk Reader, including the best-performing model, the Dynamic Decoder, which uses pre-trained BERT encodings and improves F1 from the baseline BiDAF on the test set.
TRANS-BLSTM: Transformer with Bidirectional LSTM for Language Understanding
TLDR
It is shown that TRANS-BLSTM models consistently lead to improvements in accuracy compared to BERT baselines in GLUE and SQuAD 1.1 experiments, and is proposed as a joint modeling framework for transformer and BLSTM.
Span Selection Pre-training for Question Answering
TLDR
This paper introduces a new pre-training task inspired by reading comprehension to better align the pre- training from memorization to understanding, and shows that the proposed model has strong empirical evidence as it obtains SOTA results on Natural Questions, a new benchmark MRC dataset, outperforming BERT-LARGE by 3 F1 points on short answer prediction.
Real-time Inference in Multi-sentence Tasks with Deep Pretrained Transformers
TLDR
A new architecture, the Poly-encoder, is developed that is designed to approach the performance of the Cross-encoders while maintaining reasonable computation time and achieves state-of-the-art results on both dialogue tasks.
Utilizing Bidirectional Encoder Representations from Transformers for Answer Selection
TLDR
This paper adopts the pre-trained Bidirectional Encoder Representations from Transformer (BERT) language model and fine-tuning the BERT model for the answer selection task is very effective and observes a maximum improvement in the QA and CQA datasets compared to the previous state-of-the-art models.
BERT for Question Answering on SQuAD 2 . 0
TLDR
This project picked up BERT model and tried to fine-tune it with additional task-specific layers to improve its performance on Stanford Question Answering Dataset (SQuAD 2.0).
TinyBERT: Distilling BERT for Natural Language Understanding
TLDR
A novel Transformer distillation method that is specially designed for knowledge distillation (KD) of the Transformer-based models is proposed and, by leveraging this new KD method, the plenty of knowledge encoded in a large “teacher” BERT can be effectively transferred to a small “student” TinyBERT.
Incorporating BERT into Neural Machine Translation
TLDR
A new algorithm named BERT-fused model is proposed, in which BERT is first used to extract representations for an input sequence, and then the representations are fused with each layer of the encoder and decoder of the NMT model through attention mechanisms.
Improving SQUAD 2 . 0 Performance using BERT + X
TLDR
The complexity of BERT model is increased to be more specific to the questionanswering task by incorporating highway networks, additional transformer layers and Bidirectional Attention Flow (BiDAF).
...
...

References

SHOWING 1-10 OF 60 REFERENCES
Attention is All you Need
TLDR
A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.
Dissecting Contextual Word Embeddings: Architecture and Representation
TLDR
There is a tradeoff between speed and accuracy, but all architectures learn high quality contextual representations that outperform word embeddings for four challenging NLP tasks, suggesting that unsupervised biLMs, independent of architecture, are learning much more about the structure of language than previously appreciated.
Semi-Supervised Sequence Modeling with Cross-View Training
TLDR
Cross-View Training (CVT), a semi-supervised learning algorithm that improves the representations of a Bi-LSTM sentence encoder using a mix of labeled and unlabeled data, is proposed and evaluated, achieving state-of-the-art results.
Semi-supervised sequence tagging with bidirectional language models
TLDR
A general semi-supervised approach for adding pretrained context embeddings from bidirectional language models to NLP systems and apply it to sequence labeling tasks, surpassing previous systems that use other forms of transfer or joint learning with additional labeled data and task specific gazetteers.
Character-Level Language Modeling with Deeper Self-Attention
TLDR
This paper shows that a deep (64-layer) transformer model with fixed context outperforms RNN variants by a large margin, achieving state of the art on two popular benchmarks: 1.13 bits per character on text8 and 1.06 on enwik8.
QANet: Combining Local Convolution with Global Self-Attention for Reading Comprehension
TLDR
A new Q\&A architecture called QANet is proposed, which does not require recurrent networks, and its encoder consists exclusively of convolution and self-attention, where convolution models local interactions andSelf-att attention models global interactions.
GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding
TLDR
A benchmark of nine diverse NLU tasks, an auxiliary dataset for probing models for understanding of specific linguistic phenomena, and an online platform for evaluating and comparing models, which favors models that can represent linguistic knowledge in a way that facilitates sample-efficient learning and effective knowledge-transfer across tasks.
Skip-Thought Vectors
We describe an approach for unsupervised learning of a generic, distributed sentence encoder. Using the continuity of text from books, we train an encoder-decoder model that tries to reconstruct the
MaskGAN: Better Text Generation via Filling in the ______
TLDR
This work introduces an actor-critic conditional GAN that fills in missing text conditioned on the surrounding context and shows qualitatively and quantitatively, evidence that this produces more realistic conditional and unconditional text samples compared to a maximum likelihood trained model.
Contextual String Embeddings for Sequence Labeling
TLDR
This paper proposes to leverage the internal states of a trained character language model to produce a novel type of word embedding which they refer to as contextual string embeddings, which are fundamentally model words as sequences of characters and are contextualized by their surrounding text.
...
...