• Publications
  • Influence
GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding
TLDR
We introduce the General Language Understanding Evaluation benchmark (GLUE), a tool for evaluating and analyzing the performance of models across a diverse range of existing NLU tasks. Expand
  • 995
  • 246
  • PDF
SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems
TLDR
We present SuperGLUE, a new benchmark styled after GLUE with a new set of more difficult language understanding tasks, a software toolkit, and a public leaderboard. Expand
  • 224
  • 46
  • PDF
What do you learn from context? Probing for sentence structure in contextualized word representations
TLDR
We probe word-level contextual representations from four recent models and investigate how they encode sentence structure across a range of syntactic, semantic, local, and long-range phenomena. Expand
  • 234
  • 18
  • PDF
On Measuring Social Biases in Sentence Encoders
TLDR
The Word Embedding Association Test shows that GloVe and word2vec word embeddings exhibit human-like implicit biases based on gender, race, and other social constructs. Expand
  • 75
  • 9
  • PDF
GLUE : A MultiTask Benchmark and Analysis Platform for Natural Language Understanding
For natural language understanding (NLU) technology to be maximally useful, both practically and as a scientific object of study, it must be general: it must be able to process language in a way thatExpand
  • 94
  • 8
  • PDF
Asking and Answering Questions to Evaluate the Factual Consistency of Summaries
TLDR
We propose an automatic evaluation protocol called QAGS (pronounced "kags") that is designed to identify factual inconsistencies in a generated summary. Expand
  • 29
  • 8
  • PDF
BERT has a Mouth, and It Must Speak: BERT as a Markov Random Field Language Model
TLDR
We show that BERT (Devlin et al., 2018) is a Markov random field language model. Expand
  • 86
  • 5
  • PDF
Can You Tell Me How to Get Past Sesame Street? Sentence-Level Pretraining Beyond Language Modeling
TLDR
We conduct the first large-scale systematic study of candidate pretraining tasks, comparing 19 different tasks both as alternatives and complements to language modeling. Expand
  • 38
  • 5
  • PDF
Fast Detection of Maximum Common Subgraph via Deep Q-Learning
TLDR
We propose RLMCS, a Graph Neural Network based model for MCS detection through reinforcement learning. Expand
  • 6
  • 2
  • PDF
Probing What Different NLP Tasks Teach Machines about Function Word Comprehension
TLDR
This paper investigates the role of pretraining objectives of sentence encoders, with respect to their capacity to understand function words (e.g., CCG supertagging and natural language inference). Expand
  • 31
  • 1
  • PDF