Refining Targeted Syntactic Evaluation of Language Models

  title={Refining Targeted Syntactic Evaluation of Language Models},
  author={Benjamin A. Newman and Kai-Siang Ang and Julia Gong and John Hewitt},
Targeted syntactic evaluation of subject-verb number agreement in English (TSE) evaluates language models’ syntactic knowledge using hand-crafted minimal pairs of sentences that differ only in the main verb’s conjugation. The method evaluates whether language models rate each grammatical sentence as more likely than its ungrammatical counterpart. We identify two distinct goals for TSE. First, evaluating the systematicity of a language model’s syntactic knowledge: given a sentence, can it… Expand

Figures and Tables from this paper

Frequency Effects on Syntactic Rule Learning in Transformers
Pre-trained language models perform well on a variety of linguistic tasks that require symbolic reasoning, raising the question of whether such models implicitly represent abstract symbols and rules.Expand
On the Limits of Minimal Pairs in Contrastive Evaluation
Minimal sentence pairs are frequently used to analyze the behavior of language models. It is often assumed that model behavior on contrastive pairs is predictive of model behavior at large. We argueExpand


Targeted Syntactic Evaluation of Language Models
In an experiment using this data set, an LSTM language model performed poorly on many of the constructions, and a large gap remained between its performance and the accuracy of human participants recruited online. Expand
Word Frequency Does Not Predict Grammatical Knowledge in Language Models
Focusing on subject-verb agreement and reflexive anaphora, it is found that certain nouns are systematically understood better than others, an effect which is robust across grammatical tasks and different language models. Expand
A Systematic Assessment of Syntactic Generalization in Neural Language Models
A systematic evaluation of the syntactic knowledge of neural language models, testing 20 combinations of model types and data sizes on a set of 34 English-language syntactic test suites finds substantial differences in syntactic generalization performance by model architecture. Expand
Single-Stage Prediction Models Do Not Explain the Magnitude of Syntactic Disambiguation Difficulty
It is concluded that a full explanation of syntactic disambiguation difficulty may require recovery mechanisms beyond predictability, and predicts a linear effect of surprisal: the garden-path effect is expected to be proportional to the difference in word surprisal between the ultimately correct and ultimately incorrect interpretations. Expand
Assessing the Ability of LSTMs to Learn Syntax-Sensitive Dependencies
It is concluded that LSTMs can capture a non-trivial amount of grammatical structure given targeted supervision, but stronger architectures may be required to further reduce errors; furthermore, the language modeling signal is insufficient for capturing syntax-sensitive dependencies, and should be supplemented with more direct supervision if such dependencies need to be captured. Expand
Neural language models as psycholinguistic subjects: Representations of syntactic state
Experimental methodologies which were originally developed in the field of psycholinguistics to study syntactic representation in the human mind are employed to examine neural network model behavior on sets of artificial sentences containing a variety of syntactic complex structures. Expand
Assessing Composition in Sentence Vector Representations
This work introduces a specialized sentence generation system that produces large, annotated sentence sets meeting specified syntactic, semantic and lexical constraints and finds that the method is able to extract useful information about the differing capacities of these models. Expand
LSTMs Can Learn Syntax-Sensitive Dependencies Well, But Modeling Structure Makes Them Better
It is found that the mere presence of syntactic information does not improve accuracy, but when model architecture is determined by syntax, number agreement is improved: top-down construction outperforms left-corner and bottom-up variants in capturing non-local structural dependencies. Expand
Distinct patterns of syntactic agreement errors in recurrent networks and humans
It is concluded that at least in some respects the syntactic representations acquired by RNNs are fundamentally different from those used by humans. Expand
On the Predictive Power of Neural Language Models for Human Real-Time Comprehension Behavior
Testing over two dozen models on how well their next-word expectations predict human reading time behavior on naturalistic text corpora finds that across model architectures and training dataset sizes the relationship between word log-probability and reading time is (near-)linear. Expand