How effective is BERT without word ordering? Implications for language understanding and data privacy

@inproceedings{Hessel2021HowEI,
  title={How effective is BERT without word ordering? Implications for language understanding and data privacy},
  author={Jack Hessel and Alexandra Schofield},
  booktitle={ACL},
  year={2021}
}
Ordered word sequences contain the rich structures that define language. However, it’s often not clear if or how modern pretrained language models utilize these structures. We show that the token representations and self-attention activations within BERT are surprisingly resilient to shuffling the order of input tokens, and that for several GLUE language understanding tasks, shuffling only minimally degrades performance, e.g., by 4% for QNLI. While bleak from the perspective of language… 

Figures and Tables from this paper

When classifying arguments, BERT doesn’t care about word order...except when it matters

While contextual embedding models are often praised for capturing rich grammatical structure, a spate of recent work has shown that they are surprisingly invariant to scrambling word order (Sinha et

When classifying grammatical role, BERT doesn’t care about word order... except when it matters

Because meaning can often be inferred from lexical semantics alone, word order is often a redundant cue in natural language. For example, the words chopped, chef, and onion are more likely used to

Structural Persistence in Language Models: Priming as a Window into Abstract Language Representations

TLDR
This study finds that Transformer models indeed show evidence of structural priming, but also that the generalizations they learned are to some extent modulated by semantic information, and shows that the representations acquired by the models may not only encode abstract sequential structure but involve certain level of hierarchical syntactic information.

A Customised Text Privatisation Mechanism with Differential Privacy

TLDR
A Customized differentially private Text privatization mechanism (CusText) that assigns each input token a customized output set to provide more advanced adaptive privacy protection at the token-level and overcomes the limitation for the similarity metrics caused by the 𝑑 𝜒 -privacy notion.

Revisiting Generative Commonsense Reasoning: A Pre-Ordering Approach

TLDR
It is argued that PTM’s inherent ability for generative commonsense reasoning is underestimated due to the order-agnostic property of its input, and proposed a pre-ordering approach to elaborately manipulate the order of the given concepts before generation.

Experimentally measuring the redundancy of grammatical cues in transitive clauses

Grammatical cues are sometimes redundant with word meanings in natural language. For instance, English word order rules constrain the word order of a sentence like “The dog chewed the bone” even

Compositional Evaluation on Japanese Textual Entailment and Similarity

TLDR
JSICK, a Japanese NLI/STS dataset that was manually translated from the English dataset SICK is intro-duce and a stress-test dataset for compositional inference is presented, created by transforming syntactic structures of sentences in JSICK to investigate whether language models are sensitive to word order and case particles.

References

SHOWING 1-10 OF 64 REFERENCES

Masked Language Modeling and the Distributional Hypothesis: Order Word Matters Pre-training for Little

TLDR
This paper pre-train MLMs on sentences with randomly shuffled word order, and shows that these models still achieve high accuracy after fine-tuning on many downstream tasks—including tasks specifically designed to be challenging for models that ignore word order.

Out of Order: How important is the sequential order of words in a sentence in Natural Language Understanding tasks?

TLDR
This work suggests that many GLUE tasks are not challenging machines to understand the meaning of a sentence, and encouraging models to capture word order information improves the performance on mostGLUE tasks and SQuAD 2.0.

BERT & Family Eat Word Salad: Experiments with Text Understanding

TLDR
It is shown that if models are explicitly trained to recognize invalid inputs, they can be robust to such attacks without a drop in performance.

What BERT Is Not: Lessons from a New Suite of Psycholinguistic Diagnostics for Language Models

TLDR
A suite of diagnostics drawn from human language experiments are introduced, which allow us to ask targeted questions about information used by language models for generating predictions in context, and the popular BERT model is applied.

What Does BERT Learn about the Structure of Language?

TLDR
This work provides novel support for the possibility that BERT networks capture structural information about language by performing a series of experiments to unpack the elements of English language structure learned by BERT.

What do you learn from context? Probing for sentence structure in contextualized word representations

TLDR
A novel edge probing task design is introduced and a broad suite of sub-sentence tasks derived from the traditional structured NLP pipeline are constructed to investigate how sentence structure is encoded across a range of syntactic, semantic, local, and long-range phenomena.

UnNatural Language Inference

TLDR
It is found that state-of-the-art Natural Language Inference (NLI) models assign the same labels to permuted examples as they do to the original, i.e. they are invariant to random word-order permutations.

Linguistic Knowledge and Transferability of Contextual Representations

TLDR
It is found that linear models trained on top of frozen contextual representations are competitive with state-of-the-art task-specific models in many cases, but fail on tasks requiring fine-grained linguistic knowledge.

GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding

TLDR
A benchmark of nine diverse NLU tasks, an auxiliary dataset for probing models for understanding of specific linguistic phenomena, and an online platform for evaluating and comparing models, which favors models that can represent linguistic knowledge in a way that facilitates sample-efficient learning and effective knowledge-transfer across tasks.

Analyzing Compositionality-Sensitivity of NLI Models

TLDR
This work proposes a compositionality-sensitivity testing setup that analyzes models on natural examples from existing datasets that cannot be solved via lexical features alone, hence revealing the models' actual compositionality awareness.
...