• Publications
  • Influence
BLiMP: A Benchmark of Linguistic Minimal Pairs for English
TLDR
The Benchmark of Linguistic Minimal Pairs, a challenge set for evaluating the linguistic knowledge of language models (LMs) on major grammatical phenomena in English, finds that state-of-the-art models identify morphological contrasts related to agreement reliably, but they struggle with some subtle semantic and syntactic phenomena.
Intermediate-Task Transfer Learning with Pretrained Language Models: When and Why Does It Work?
TLDR
It is observed that intermediate tasks requiring high-level inference and reasoning abilities tend to work best and that target task performance is strongly correlated with higher-level abilities such as coreference resolution, but it is failed to observe more granular correlations between probing and target taskperformance.
Learning Which Features Matter: RoBERTa Acquires a Preference for Linguistic Generalizations (Eventually)
TLDR
A new English-language diagnostic set called MSGS (the Mixed Signals Generalization Set), which consists of 20 ambiguous binary classification tasks that are used to test whether a pretrained model prefers linguistic or surface generalizations during fine-tuning, finds that models can learn to represent linguistic features with little pretraining data, but require far more data to learn to prefer linguistic generalizations over surface ones.
Investigating BERT’s Knowledge of Language: Five Analysis Methods with NPIs
TLDR
It is concluded that a variety of methods is necessary to reveal all relevant aspects of a model’s grammatical knowledge in a given domain.
Intermediate-Task Transfer Learning with Pretrained Models for Natural Language Understanding: When and Why Does It Work?
TLDR
It is observed that intermediate tasks requiring high-level inference and reasoning abilities tend to work best and that target task performance is strongly correlated with higher-level abilities such as coreference resolution, but it is failed to observe more granular correlations between probing and target taskperformance.
jiant: A Software Toolkit for Research on General-Purpose Text Understanding Models
TLDR
Jiant is introduced, an open source toolkit for conducting multitask and transfer learning experiments on English NLU tasks and it is demonstrated that jiant reproduces published performance on a variety of tasks and models.
English Intermediate-Task Training Improves Zero-Shot Cross-Lingual Transfer Too
TLDR
This work evaluates intermediate-task transfer in a zero-shot cross-lingual setting on the XTREME benchmark, and finds MNLI, SQuAD and HellaSwag achieve the best overall results as intermediate tasks, while multi-task intermediate offers small additional improvements.
Counterfactually-Augmented SNLI Training Data Does Not Yield Better Generalization Than Unaugmented Data
TLDR
It is found that models trained on a counterfactually-augmented SNLI dataset do not generalize better than unaugmenting datasets of similar size and that counterfactual augmentation can hurt performance, yielding models that are less robust to challenge examples.
Precise Task Formalization Matters in Winograd Schema Evaluations
TLDR
This work performs an ablation on two Winograd Schema datasets that interpolates between the formalizations used before and after this surge, and finds framing the task as multiple choice improves performance by 2-6 points and several additional techniques can mitigate the model's extreme sensitivity to hyperparameters.
Retrieving Relevant and Diverse Image from Social Media Images
TLDR
Experimental results show the method can retrieve diverse images with a moderate relevance to the topic and the basic idea is removing the irrelevant images and then obtaining the diverse image using a greedy strategy.
...
1
2
...