Memorization vs. Generalization : Quantifying Data Leakage in NLP Performance Evaluation

@article{Elangovan2021MemorizationVG,
  title={Memorization vs. Generalization : Quantifying Data Leakage in NLP Performance Evaluation},
  author={Aparna Elangovan and Jiayuan He and Karin M. Verspoor},
  journal={ArXiv},
  year={2021},
  volume={abs/2102.01818}
}
Public datasets are often used to evaluate the efficacy and generalizability of state-of-the-art methods for many tasks in natural language processing (NLP). However, the presence of overlap between the train and test datasets can lead to inflated results, inadvertently evaluating the model’s ability to memorize and interpreting it as the ability to generalize. In addition, such data sets may not provide an effective indicator of the performance of these methods in real world scenarios. We… 

An Empirical Study of Memorization in NLP

TLDR
It is demonstrated that top-ranked memorized training instances are likely atypical, and removing the top-memorization training instances leads to a more serious drop in test accuracy compared with removing training instances randomly, and an attribution method is developed to better understand why a training instance is memorized.

Does it Really Generalize Well on Unseen Data? Systematic Evaluation of Relational Triple Extraction Methods

TLDR
It is shown that although existing extraction models are able to easily memorize and recall already seen triples, they cannot generalize effectively for unseen triples and this simple yet effective augmentation technique can significantly increase the generalization performance of existing models.

Impact of Pretraining Term Frequencies on Few-Shot Reasoning

TLDR
Overall, although LMs exhibit strong performance at few-shot numerical reasoning tasks, the results raise the question of how much models actually generalize beyond pretraining data, and researchers are encouraged to take thepretraining data into account when interpreting evaluation results.

Decoupling Knowledge from Memorization: Retrieval-augmented Prompt Learning

TLDR
This work develops R ETRO P ROMPT with the motivation of decoupling knowledge from memorization to help the model strike a balance between generalization and memorization, which can reduce the reliance of language models on memorization and improve generalization for downstream tasks.

How not to Lie with a Benchmark: Rearranging NLP Leaderboards

TLDR
Examination of popular NLP benchmarks’ overall scoring methods and rearrange the models by geometric and harmonic mean (appropriate for averaging rates) according to their reported results shows that e.g. human level on GLUE is still not reached, and there is still room for improvement for the current models.

Challenging the Transformer-based models with a Classical Arabic dataset: Quran and Hadith

TLDR
Monolingual, bilingual, and multilingual state-of-the-art models are evaluated to detect relatedness between the Quran and the Hadith, which are complex classical Arabic texts with underlying meanings that require deep human understanding.

Dataset Debt in Biomedical Language Modeling

TLDR
A crowdsourced curation of datasheets for 167 biomedical datasets finds that only 13% of datasets are available via programmatic access and 30% lack any documentation on licensing and permitted reuse.

Cut the CARP: Fishing for zero-shot story evaluation

TLDR
A strong correlation between human evaluation of stories and those of carp is shown, and model outputs more significantly correlate with corresponding human input than those language-model based methods which utilize finetuning or prompt engineering approaches.

How Emotionally Stable is ALBERT? Testing Robustness with Stochastic Weight Averaging on a Sentiment Analysis Task

TLDR
It is hypothesized that SWA is more stable because it ensembles model snapshots taken along the gradient descent trajectory, and thatSWA reduces error rates in general; yet the models still suffer from their own distinct biases.

References

SHOWING 1-10 OF 26 REFERENCES

GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding

TLDR
A benchmark of nine diverse NLU tasks, an auxiliary dataset for probing models for understanding of specific linguistic phenomena, and an online platform for evaluating and comparing models, which favors models that can represent linguistic knowledge in a way that facilitates sample-efficient learning and effective knowledge-transfer across tasks.

SemEval-2017 Task 1: Semantic Textual Similarity Multilingual and Crosslingual Focused Evaluation

TLDR
The STS Benchmark is introduced as a new shared training and evaluation set carefully selected from the corpus of English STS shared task data (2012-2017), providing insight into the limitations of existing models.

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

TLDR
A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.

Entity-Enriched Neural Models for Clinical Question Answering

We explore state-of-the-art neural models for question answering on electronic medical records and improve their ability to generalize better on previously unseen (paraphrased) questions at test

Question and Answer Test-Train Overlap in Open-Domain Question Answering Datasets

TLDR
A detailed study of the test sets of three popular open-domain benchmark datasets finds that 30% of test-set questions have a near-duplicate paraphrase in their corresponding train sets, and that simple nearest-neighbor models outperform a BART closed-book QA model.

Learning and Memorization

TLDR
This work explores if it is possible to generalize by memorizing alone, and finds that introducing depth in the form of a network of support-limited lookup tables leads to generalization that is significantly above chance and closer to those obtained by standard learning algorithms on several tasks derived from MNIST and CIFAR-10.

Semantic Parsing on Freebase from Question-Answer Pairs

TLDR
This paper trains a semantic parser that scales up to Freebase and outperforms their state-of-the-art parser on the dataset of Cai and Yates (2013), despite not having annotated logical forms.

Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank

TLDR
A Sentiment Treebank that includes fine grained sentiment labels for 215,154 phrases in the parse trees of 11,855 sentences and presents new challenges for sentiment compositionality, and introduces the Recursive Neural Tensor Network.

Evaluation Methods for Statistically Dependent Text

TLDR
By ignoring the statistical dependence of the text messages published in social media, standard cross-validation can result in misleading conclusions in a machine learning task, and this work explores alternative evaluation methods that explicitly deal with statistical dependence in text.

emrQA: A Large Corpus for Question Answering on Electronic Medical Records

TLDR
A novel methodology to generate domain-specific large-scale question answering (QA) datasets by re-purposing existing annotations for other NLP tasks is proposed and an instance of this methodology is demonstrated in generating a large- scale QA dataset for electronic medical records.