• Publications
  • Influence
Universal Dependencies 2.1
The annotation scheme is based on (universal) Stanford dependencies, Google universal part-of-speech tags, and the Interset interlingua for morphosyntactic tagsets for morpho-lingual tagsets.
BBQ: A hand-built bias benchmark for question answering
The Bias Benchmark for QA (BBQ), a dataset consisting of question-sets constructed by the authors that highlight attested social biases against people belonging to protected classes along nine different social dimensions relevant for U.S. English-speaking contexts, is introduced.
Crowdsourcing Beyond Annotation: Case Studies in Benchmark Data Collection
This tutorial exposes NLP researchers to data collection crowdsourcing methods and principles that were carefully designed to achieve data with specific properties, for example to require logical inference, grounded reasoning or conversational understanding.
What Ingredients Make for an Effective Crowdsourcing Protocol for Difficult NLU Data Collection Tasks?
It is found that asking workers to write explanations for their examples is an ineffective stand-alone strategy for boosting NLU example difficulty and that training crowdworkers, and then using an iterative process of collecting data, sending feedback, and qualifying workers based on expert judgments is an effective means of collecting challenging data.
Comparing Test Sets with Item Response Theory
Quoref, HellaSwag, and MC-TACO are best suited for distinguishing among state-of-the-art models, while SNLI, MNLI, and CommitmentBank seem to be saturated for current strong models.
QuALITY: Question Answering with Long Input Texts, Yes!
QuALITY, a multiple-choice QA dataset with context passages in English that have an average length of about 5,000 tokens, much longer than typical current models can process, is introduced to enable building and testing models on long-document comprehension.
NOPE: A Corpus of Naturally-Occurring Presuppositions in English
This work introduces the Naturally-Occurring Presuppositions in English (NOPE) Corpus to investigate the context-sensitivity of 10 different types of presupposition triggers and to evaluate machine learning models’ ability to predict human inferences.
The Dangers of Underclaiming: Reasons for Caution When Reporting How NLP Systems Fail
Researchers in NLP often frame and discuss research results in ways that serve to deemphasize the field’s successes, often in response to the field’s widespread hype. Though well-meaning, this has