• Publications
  • Influence
Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference
There is substantial room for improvement in NLI systems, and the HANS dataset can motivate and measure progress in this area, which contains many examples where the heuristics fail.
Assessing the Ability of LSTMs to Learn Syntax-Sensitive Dependencies
It is concluded that LSTMs can capture a non-trivial amount of grammatical structure given targeted supervision, but stronger architectures may be required to further reduce errors; furthermore, the language modeling signal is insufficient for capturing syntax-sensitive dependencies, and should be supplemented with more direct supervision if such dependencies need to be captured.
Colorless green recurrent networks dream hierarchically
Support is brought to the hypothesis that RNNs are not just shallow-pattern extractors, but they also acquire deeper grammatical competence by making reliable predictions about long-distance agreement and do not lag much behind human performance.
BERTs of a feather do not generalize together: Large variability in generalization across models with similar test set performance
This work fine-tuned 100 instances of BERT on the Multi-genre Natural Language Inference dataset and evaluated them on the HANS dataset, which evaluates syntactic generalization in natural language inference.
Targeted Syntactic Evaluation of Language Models
In an experiment using this data set, an LSTM language model performed poorly on many of the constructions, and a large gap remained between its performance and the accuracy of human participants recruited online.
Revisiting the poverty of the stimulus: hierarchical generalization without a hierarchical bias in recurrent neural networks
Using recurrent neural networks (RNNs) to simulate the acquisition of question formation, a hierarchical transformation, in an artificial language modeled after English, it is found that some RNN architectures consistently learn the correct hierarchical rule instead.
COGS: A Compositional Generalization Challenge Based on Semantic Interpretation
In experiments with Transformers and LSTMs, it is found that in-distribution accuracy on the COGS test set was near-perfect, but generalization accuracy was substantially lower, and the dataset showed high sensitivity to random seed.
Lexical Preactivation in Basic Linguistic Phrases
Many previous studies have shown that predictable words are read faster and lead to reduced neural activation, consistent with a model of reading in which words are activated in advance of being
Issues in evaluating semantic spaces using word analogies
  • Tal Linzen
  • Computer Science
  • 24 June 2016
It is shown that the offset method's reliance on cosine similarity conflates offset consistency with largely irrelevant neighborhood structure, and simple baselines are proposed that should be used to improve the utility of the method in vector space evaluation.
Human few-shot learning of compositional instructions
This work studies the compositional skills of people through language-like instruction learning tasks, showing that people can learn and use novel functional concepts from very few examples, and compose concepts in complex ways that go beyond the provided demonstrations.