• Publications
  • Influence
CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge
TLDR
This work presents CommonsenseQA: a challenging new dataset for commonsense question answering, which extracts from ConceptNet multiple target concepts that have the same semantic relation to a single source concept.
ATOMIC: An Atlas of Machine Commonsense for If-Then Reasoning
TLDR
Experimental results demonstrate that multitask models that incorporate the hierarchical structure of if-then relation types lead to more accurate inference compared to models trained in isolation, as measured by both automatic and human evaluation.
Dataset Cartography: Mapping and Diagnosing Datasets with Training Dynamics
TLDR
The results indicate that a shift in focus from quantity to quality of data could lead to robust models and improved out-of-distribution generalization, and a model-based tool to characterize and diagnose datasets.
Extracting Scientific Figures with Distantly Supervised Neural Networks
TLDR
This paper induces high-quality training labels for the task of figure extraction in a large number of scientific documents, with no human intervention, and uses this dataset to train a deep neural network for end-to-end figure detection, yielding a model that can be more easily extended to new domains compared to previous work.
UNICORN on RAINBOW: A Universal Commonsense Reasoning Model on a New Multitask Benchmark
TLDR
A new multitask benchmark, RAINBOW, is proposed to promote research on commonsense models that generalize well over multiple tasks and datasets and a novel evaluation is proposed, the cost equivalent curve, that sheds new insight on how the choice of source datasets, pretrained language models, and transfer learning methods impacts performance and data efficiency.
Learning from Task Descriptions
TLDR
This work introduces a framework for developing NLP systems that solve new tasks after reading their descriptions, synthesizing prior work in this area, and instantiates it with a new English language dataset, ZEST, structured for task-oriented evaluation on unseen tasks.
GENIE: A Leaderboard for Human-in-the-Loop Evaluation of Text Generation
TLDR
This work introduces GENIE, an extensible human evaluation leaderboard, which brings the ease of leaderboards to text generation tasks and provides formal granular evaluation metrics and identifies areas for future research.
Scruples: A Corpus of Community Ethical Judgments on 32, 000 Real-Life Anecdotes
TLDR
This work introduces Scruples, the first large-scale dataset with 625,000 ethical judgments over 32,000 real-life anecdotes, and presents a new method to estimate the best possible performance on such tasks with inherently diverse label distributions, and explores likelihood functions that separate intrinsic from model uncertainty.
Writing Code for NLP Research
TLDR
This tutorial aims to share best practices for writing code for NLP research, drawing on the instructors' experience designing the recently-released AllenNLP toolkit, a PyTorch-based library for deep learning NLPResearch.
Findings of the 2021 Conference on Machine Translation (WMT21)
This paper presents the results of the newstranslation task, the multilingual low-resourcetranslation for Indo-European languages, thetriangular translation task, and the automaticpost-editing task
...
1
2
...