• Corpus ID: 245877621

How Does Data Corruption Affect Natural Language Understanding Models? A Study on GLUE datasets

  title={How Does Data Corruption Affect Natural Language Understanding Models? A Study on GLUE datasets},
  author={Aarne Talman and Marianna Apidianaki and Stergios Chatzikyriakidis and J{\"o}rg Tiedemann},
A central question in natural language understanding (NLU) research is whether high performance demonstrates the models’ strong rea-soning capabilities. We present an extensive series of controlled experiments where pre-trained language models are exposed to data that have undergone specific corruption transformations. These involve removing instances of specific word classes and often lead to non-sensical sentences. Our results show that performance remains high on most GLUE tasks when the… 

Figures and Tables from this paper


NLI Data Sanity Check: Assessing the Effect of Data Corruption on Model Performance
A new diagnostics test suite is proposed which allows to assess whether a dataset constitutes a good testbed for evaluating the models’ meaning understanding capabilities, and applies controlled corruption transformations to widely used benchmarks.
Masked Language Modeling and the Distributional Hypothesis: Order Word Matters Pre-training for Little
This paper pre-train MLMs on sentences with randomly shuffled word order, and shows that these models still achieve high accuracy after fine-tuning on many downstream tasks—including tasks specifically designed to be challenging for models that ignore word order.
GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding
A benchmark of nine diverse NLU tasks, an auxiliary dataset for probing models for understanding of specific linguistic phenomena, and an online platform for evaluating and comparing models, which favors models that can represent linguistic knowledge in a way that facilitates sample-efficient learning and effective knowledge-transfer across tasks.
Annotation Artifacts in Natural Language Inference Data
It is shown that a simple text categorization model can correctly classify the hypothesis alone in about 67% of SNLI and 53% of MultiNLI, and that specific linguistic phenomena such as negation and vagueness are highly correlated with certain inference classes.
Hypothesis Only Baselines in Natural Language Inference
This approach, which is referred to as a hypothesis-only model, is able to significantly outperform a majority-class baseline across a number of NLI datasets and suggests that statistical irregularities may allow a model to perform NLI in some datasets beyond what should be achievable without access to the context.
Testing the Generalization Power of Neural Network Models across NLI Benchmarks
It is argued that most of the current neural network models are not able to generalize well in the task of natural language inference, and it is found that using large pre-trained language models helps with transfer learning when the datasets are similar enough.
Assessing the Benchmarking Capacity of Machine Reading Comprehension Datasets
Results suggest that most of the questions already answered correctly by the model do not necessarily require grammatical and complex reasoning, and therefore, MRC datasets will need to take extra care in their design to ensure that questions can correctly evaluate the intended skills.
Shaking Syntactic Trees on the Sesame Street: Multilingual Probing with Controllable Perturbations
It is found that the syntactic sensitivity depends on the language and model pre-training objectives, and the sensitivity grows across layers together with the increase of the perturbation granularity, and it is shown that the models barely use the positional information to induce syntactic trees from their intermediate self-attention and contextualized representations.
What BERT Is Not: Lessons from a New Suite of Psycholinguistic Diagnostics for Language Models
A suite of diagnostics drawn from human language experiments are introduced, which allow us to ask targeted questions about information used by language models for generating predictions in context, and the popular BERT model is applied.
Out of Order: How important is the sequential order of words in a sentence in Natural Language Understanding tasks?
This work suggests that many GLUE tasks are not challenging machines to understand the meaning of a sentence, and encouraging models to capture word order information improves the performance on mostGLUE tasks and SQuAD 2.0.