• Publications
  • Influence
RoBERTa: A Robustly Optimized BERT Pretraining Approach
Language model pretraining has led to significant performance gains but careful comparison between different approaches is challenging. Training is computationally expensive, often done on privateExpand
  • 1,417
  • 461
  • PDF
GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding
For natural language understanding (NLU) technology to be maximally useful, both practically and as a scientific object of study, it must be general: it must be able to process language in a way thatExpand
  • 860
  • 222
  • PDF
Neural Word Embedding as Implicit Matrix Factorization
We analyze skip-gram with negative-sampling (SGNS), a word embedding method introduced by Mikolov et al., and show that it is implicitly factorizing a word-context matrix, whose cells are theExpand
  • 1,206
  • 134
  • PDF
Improving Distributional Similarity with Lessons Learned from Word Embeddings
Recent trends suggest that neural-network-inspired word embedding models outperform traditional count-based distributional models on word similarity and analogy detection tasks. We reveal that muchExpand
  • 932
  • 117
  • PDF
Dependency-Based Word Embeddings
While continuous word embeddings are gaining popularity, current models are based solely on linear contexts. In this work, we generalize the skip-gram model with negative sampling introduced byExpand
  • 693
  • 63
  • PDF
Annotation Artifacts in Natural Language Inference Data
Large-scale datasets for natural language inference are created by presenting crowd workers with a sentence (premise), and asking them to generate three new sentences (hypotheses) that it entails,Expand
  • 297
  • 61
  • PDF
word2vec Explained: deriving Mikolov et al.'s negative-sampling word-embedding method
The word2vec software of Tomas Mikolov and colleagues (this https URL ) has gained a lot of traction lately, and provides state-of-the-art word embeddings. The learning models behind the software areExpand
  • 834
  • 57
  • PDF
BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension
We present BART, a denoising autoencoder for pretraining sequence-to-sequence models. BART is trained by (1) corrupting text with an arbitrary noising function, and (2) learning a model toExpand
  • 221
  • 52
  • PDF
Linguistic Regularities in Sparse and Explicit Word Representations
Recent work has shown that neuralembedded word representations capture many relational similarities, which can be recovered by means of vector arithmetic in the embedded space. We show that MikolovExpand
  • 512
  • 48
  • PDF
SpanBERT: Improving Pre-training by Representing and Predicting Spans
We present SpanBERT, a pre-training method that is designed to better represent and predict spans of text. Our approach extends BERT by (1) masking contiguous random spans, rather than random tokens,Expand
  • 202
  • 44
  • PDF