BERT & Family Eat Word Salad: Experiments with Text Understanding

  title={BERT \& Family Eat Word Salad: Experiments with Text Understanding},
  author={Ashim Gupta and Giorgi Kvernadze and Vivek Srikumar},
In this paper, we study the response of large models from the BERT family to incoherent inputs that should confuse any model that claims to understand natural language. We define simple heuristics to construct such examples. Our experiments show that state-of-the-art models consistently fail to recognize them as ill-formed, and instead produce high confidence predictions on them. As a consequence of this phenomenon, models trained on sentences with randomly permuted word order perform close to… 

Penalizing Confident Predictions on Largely Perturbed Inputs Does Not Improve Out-of-Distribution Generalization in Question Answering

To prevent models from making confident predictions on perturbed inputs, models are trained to be sensitive to a cer- tain perturbation type are often insensitive to unseen types of perturbations, and researchers should pay attention to the side effect of entropy maximization.

Local Structure Matters Most: Perturbation Study in NLU

It is empirically shown that neural models, invariant of their inductive biases, pretraining scheme, or the choice of tokenization, mostly rely on the local structure of text to build understanding and make limited use of the global structure.

Demystifying Neural Language Models' Insensitivity to Word-Order

The insensitivity of natural language models to word-order is investigated by quantifying perturbations and analysing their effect on neural models’ performance on language understanding tasks in GLUE benchmark and it is found that neural language models — pretrained and non-pretrained Transformers, LSTMs, and Convolutional architectures — require local ordering more than the global ordering of tokens.

Masked Language Modeling and the Distributional Hypothesis: Order Word Matters Pre-training for Little

This paper pre-train MLMs on sentences with randomly shuffled word order, and shows that these models still achieve high accuracy after fine-tuning on many downstream tasks—including tasks specifically designed to be challenging for models that ignore word order.

Rissanen Data Analysis: Examining Dataset Characteristics via Description Length

The method is called Rissanen Data Analysis (RDA) after the father of MDL, and its applicability on a wide variety of settings in NLP is showcased, ranging from evaluating the utility of generating subquestions before answering a question, to analyzing the value of rationales and explanations, to investigating the importance of different parts of speech, and uncovering dataset gender bias.

Local Structure Matters Most in Most Languages

This work replicates a study on the importance of local structure, and the relative unimportance of global structure, in a multilingual setting and finds that the phenomenon observed on the English language broadly translates to over 120 languages, with a few caveats.

Detecting Languages Unintelligible to Multilingual Models through Local Structure Probes

A general approach that requires only unlabelled text to detect which languages are not well understood by a cross-lingual model, derived from the hypothesis that if a model’s understanding is insensitive to perturbations to text in a language, it is likely to have a limited understanding of that language.

Why is Winoground Hard? Investigating Failures in Visuolinguistic Compositionality

This paper identifies the dataset’s main challenges through a suite of experiments on related tasks (probing task, image retrieval task), data augmentation, and manual inspection, and suggests that a main challenge in visuolinguistic models may lie in fusing visual and textual representations, rather than in compositional language understanding.

BECEL: Benchmark for Consistency Evaluation of Language Models

This paper proposes the idea of LM consistency based on behavioural consistency and establishes a taxonomy that classifies previously studied consistencies into several sub-categories, and creates a new benchmark that allows for a more precise evaluation.

Do Language Models Make Human-like Predictions about the Coreferents of Italian Anaphoric Zero Pronouns?

Some languages allow arguments to be omitted in certain contexts. Yet human language comprehenders reliably infer the intended referents of these zero pronouns, in part because they construct



Syntactic Data Augmentation Increases Robustness to Inference Heuristics

The best-performing augmentation method, subject/object inversion, improved BERT’s accuracy on controlled examples that diagnose sensitivity to word order from 0.28 to 0.73, suggesting that augmentation causes BERT to recruit abstract syntactic representations.

Climbing towards NLU: On Meaning, Form, and Understanding in the Age of Data

It is argued that a system trained only on form has a priori no way to learn meaning, and a clear understanding of the distinction between form and meaning will help guide the field towards better science around natural language understanding.

Annotation Artifacts in Natural Language Inference Data

It is shown that a simple text categorization model can correctly classify the hypothesis alone in about 67% of SNLI and 53% of MultiNLI, and that specific linguistic phenomena such as negation and vagueness are highly correlated with certain inference classes.

Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference

There is substantial room for improvement in NLI systems, and the HANS dataset can motivate and measure progress in this area, which contains many examples where the heuristics fail.

Thieves on Sesame Street! Model Extraction of BERT-based APIs

This work highlights an exploit only made feasible by the shift towards transfer learning methods within the NLP community: for a query budget of a few hundred dollars, an attacker can extract a model that performs only slightly worse than the victim model.

ALBERT: A Lite BERT for Self-supervised Learning of Language Representations

This work presents two parameter-reduction techniques to lower memory consumption and increase the training speed of BERT, and uses a self-supervised loss that focuses on modeling inter-sentence coherence.

A large annotated corpus for learning natural language inference

The Stanford Natural Language Inference corpus is introduced, a new, freely available collection of labeled sentence pairs, written by humans doing a novel grounded task based on image captioning, which allows a neural network-based model to perform competitively on natural language inference benchmarks for the first time.

Hypothesis Only Baselines in Natural Language Inference

This approach, which is referred to as a hypothesis-only model, is able to significantly outperform a majority-class baseline across a number of NLI datasets and suggests that statistical irregularities may allow a model to perform NLI in some datasets beyond what should be achievable without access to the context.

Deep Contextualized Word Representations

A new type of deep contextualized word representation is introduced that models both complex characteristics of word use and how these uses vary across linguistic contexts, allowing downstream models to mix different types of semi-supervision signals.

Fine-Tuning Pretrained Language Models: Weight Initializations, Data Orders, and Early Stopping

This work investigates how the performance of the best-found model varies as a function of the number of fine-tuning trials, and examines two factors influenced by the choice of random seed: weight initialization and training data order.