Memorization vs. Generalization : Quantifying Data Leakage in NLP Performance Evaluation

  title={Memorization vs. Generalization : Quantifying Data Leakage in NLP Performance Evaluation},
  author={Aparna Elangovan and Jiayuan He and Karin M. Verspoor},
Public datasets are often used to evaluate the efficacy and generalizability of state-of-the-art methods for many tasks in natural language processing (NLP). However, the presence of overlap between the train and test datasets can lead to inflated results, inadvertently evaluating the model’s ability to memorize and interpreting it as the ability to generalize. In addition, such data sets may not provide an effective indicator of the performance of these methods in real world scenarios. We… 

An Empirical Study of Memorization in NLP

It is demonstrated that top-ranked memorized training instances are likely atypical, and removing the top-memorization training instances leads to a more serious drop in test accuracy compared with removing training instances randomly, and an attribution method is developed to better understand why a training instance is memorized.

Does it Really Generalize Well on Unseen Data? Systematic Evaluation of Relational Triple Extraction Methods

It is shown that although existing extraction models are able to easily memorize and recall already seen triples, they cannot generalize effectively for unseen triples and this simple yet effective augmentation technique can significantly increase the generalization performance of existing models.

Impact of Pretraining Term Frequencies on Few-Shot Reasoning

Overall, although LMs exhibit strong performance at few-shot numerical reasoning tasks, the results raise the question of how much models actually generalize beyond pretraining data, and researchers are encouraged to take thepretraining data into account when interpreting evaluation results.

Decoupling Knowledge from Memorization: Retrieval-augmented Prompt Learning

This work develops R ETRO P ROMPT with the motivation of decoupling knowledge from memorization to help the model strike a balance between generalization and memorization, which can reduce the reliance of language models on memorization and improve generalization for downstream tasks.

How not to Lie with a Benchmark: Rearranging NLP Leaderboards

Examination of popular NLP benchmarks’ overall scoring methods and rearrange the models by geometric and harmonic mean (appropriate for averaging rates) according to their reported results shows that e.g. human level on GLUE is still not reached, and there is still room for improvement for the current models.

Challenging the Transformer-based models with a Classical Arabic dataset: Quran and Hadith

Monolingual, bilingual, and multilingual state-of-the-art models are evaluated to detect relatedness between the Quran and the Hadith, which are complex classical Arabic texts with underlying meanings that require deep human understanding.

State-of-the-art generalisation research in NLP: a taxonomy and review

A taxonomy for characterising and understanding generalisation research in NLP is presented, a taxonomy is used to present a comprehensive map of published generalisation studies, and recommendations for which areas might deserve attention in the future are made.

Dataset Debt in Biomedical Language Modeling

A crowdsourced curation of datasheets for 167 biomedical datasets finds that only 13% of datasets are available via programmatic access and 30% lack any documentation on licensing and permitted reuse.

Cut the CARP: Fishing for zero-shot story evaluation

A strong correlation between human evaluation of stories and those of carp is shown, and model outputs more significantly correlate with corresponding human input than those language-model based methods which utilize finetuning or prompt engineering approaches.



GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding

A benchmark of nine diverse NLU tasks, an auxiliary dataset for probing models for understanding of specific linguistic phenomena, and an online platform for evaluating and comparing models, which favors models that can represent linguistic knowledge in a way that facilitates sample-efficient learning and effective knowledge-transfer across tasks.

TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension

It is shown that, in comparison to other recently introduced large-scale datasets, TriviaQA has relatively complex, compositional questions, has considerable syntactic and lexical variability between questions and corresponding answer-evidence sentences, and requires more cross sentence reasoning to find answers.

SemEval-2017 Task 1: Semantic Textual Similarity Multilingual and Crosslingual Focused Evaluation

The STS Benchmark is introduced as a new shared training and evaluation set carefully selected from the corpus of English STS shared task data (2012-2017), providing insight into the limitations of existing models.

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.

Entity-Enriched Neural Models for Clinical Question Answering

We explore state-of-the-art neural models for question answering on electronic medical records and improve their ability to generalize better on previously unseen (paraphrased) questions at test

Question and Answer Test-Train Overlap in Open-Domain Question Answering Datasets

A detailed study of the test sets of three popular open-domain benchmark datasets finds that 30% of test-set questions have a near-duplicate paraphrase in their corresponding train sets, and that simple nearest-neighbor models outperform a BART closed-book QA model.

Learning and Memorization

This work explores if it is possible to generalize by memorizing alone, and finds that introducing depth in the form of a network of support-limited lookup tables leads to generalization that is significantly above chance and closer to those obtained by standard learning algorithms on several tasks derived from MNIST and CIFAR-10.

Semantic Parsing on Freebase from Question-Answer Pairs

This paper trains a semantic parser that scales up to Freebase and outperforms their state-of-the-art parser on the dataset of Cai and Yates (2013), despite not having annotated logical forms.

Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank

A Sentiment Treebank that includes fine grained sentiment labels for 215,154 phrases in the parse trees of 11,855 sentences and presents new challenges for sentiment compositionality, and introduces the Recursive Neural Tensor Network.

Towards Knowledge-Based, Robust Question Answering

TransINT is introduced, a novel and interpretable knowledge graph embedding method that isomorphically preserves the implication ordering among relations in the embedding space and methods to train sequence-to-sequence semantic parsing models robust to unseen paraphrases.