BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

@article{Devlin2018BERTPO,
  title={BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding},
  author={Jacob Devlin and Ming-Wei Chang and Kenton Lee and Kristina Toutanova},
  journal={ArXiv},
  year={2018},
  volume={abs/1810.04805}
}
We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers. Unlike recent language representation models, BERT is designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT model can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such… CONTINUE READING

Figures, Tables, Results, and Topics from this paper.

Key Quantitative Results

  • It obtains new state-of-the-art results on eleven natural language processing tasks, including pushing the GLUE score to 80.5% (7.7% point absolute improvement), MultiNLI accuracy to 86.7% (4.6% absolute improvement), SQuAD v1.1 question answering Test F1 to 93.2 (1.5 point absolute improvement) and SQuAD v2.0 Test F1 to 83.1 (5.1 point absolute improvement).None Proceedings of NAACL-HLT 2019, pages 4171–4186 Minneapolis, Minnesota, June 2 - June 7, 2019.
  • It obtains new state-of-the-art results on eleven natural language processing tasks, including pushing the GLUE score to 80.5% (7.7% point absolute improvement), MultiNLI accuracy to 86.7% (4.6% absolute improvement), SQuAD v1.1 question answering Test F1 to 93.2 (1.5 point absolute improvement) and SQuAD v2.0 Test F1 to 83.1 (5.1 point absolute improvement).
  • Both BERTBASE and BERTLARGE outperform all systems on all tasks by a substantial margin, obtaining 4.5% and 7.0% respective average accuracy improvement over the prior state of the art.

Citations

Publications citing this paper.
SHOWING 1-10 OF 973 CITATIONS

BAM! Born-Again Multi-Task Networks for Natural Language Understanding

  • ACL
  • 2019
VIEW 5 EXCERPTS
CITES METHODS
HIGHLY INFLUENCED

75 Languages, 1 Model: Parsing Universal Dependencies Universally

VIEW 5 EXCERPTS
CITES METHODS & BACKGROUND
HIGHLY INFLUENCED

A Bi-directional Transformer for Musical Chord Recognition

Jonggwon Park, Kyoyun Choi, Sungwook Jeon, Do Kyun Kim, Jonghun Park
  • ArXiv
  • 2019
VIEW 4 EXCERPTS
CITES BACKGROUND
HIGHLY INFLUENCED

A Hybrid Neural Network Model for Commonsense Reasoning

  • ArXiv
  • 2019
VIEW 5 EXCERPTS
CITES METHODS
HIGHLY INFLUENCED

A Lightweight Recurrent Network for Sequence Modeling

VIEW 5 EXCERPTS
CITES METHODS
HIGHLY INFLUENCED

An Adversarial Winograd Schema Challenge at Scale

VIEW 4 EXCERPTS
CITES METHODS & BACKGROUND
HIGHLY INFLUENCED

Analyzing the Structure of Attention in a Transformer Language Model

VIEW 8 EXCERPTS
CITES METHODS & BACKGROUND
HIGHLY INFLUENCED

Beto, Bentz, Becas: The Surprising Cross-Lingual Effectiveness of BERT

VIEW 7 EXCERPTS
CITES BACKGROUND & METHODS
HIGHLY INFLUENCED

FILTER CITATIONS BY YEAR

2017
2019

CITATION STATISTICS

  • 389 Highly Influenced Citations

  • Averaged 322 Citations per year from 2017 through 2019

  • 1,436% Increase in citations per year in 2019 over 2018

References

Publications referenced by this paper.
SHOWING 1-10 OF 54 REFERENCES

Deep contextualized word representations

VIEW 16 EXCERPTS
HIGHLY INFLUENTIAL

Improving language understanding with unsupervised learning

Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever.
  • Technical report, OpenAI.
  • 2018
VIEW 10 EXCERPTS
HIGHLY INFLUENTIAL

Attention Is All You Need

VIEW 4 EXCERPTS
HIGHLY INFLUENTIAL

Similar Papers