When is memorization of irrelevant training data necessary for high-accuracy learning?

  title={When is memorization of irrelevant training data necessary for high-accuracy learning?},
  author={Gavin Brown and Mark Bun and Vitaly Feldman and Adam M. Smith and Kunal Talwar},
  journal={Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing},
  • Gavin Brown, Mark Bun, Kunal Talwar
  • Published 11 December 2020
  • Computer Science
  • Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing
Modern machine learning models are complex and frequently encode surprising amounts of information about individual inputs. In extreme cases, complex models appear to memorize entire input examples, including seemingly irrelevant information (social security numbers from text, for example). In this paper, we aim to understand whether this sort of memorization is necessary for accurate learning. We describe natural prediction problems in which every sufficiently accurate training algorithm must… 

Figures and Tables from this paper

Strong Memory Lower Bounds for Learning Natural Models
Lower bounds on the amount of memory required by one-pass streaming algorithms for solving several natural learning problems are given, the first such bounds for problems of the type commonly seen in recent learning applications that apply for a large range of input sizes.
Privacy Analysis in Language Models via Training Data Leakage Report
A methodology that investigates identifying the user content in the training data that could be leaked under a strong and realistic threat model is introduced and two metrics to quantify user-level data leakage by measuring a model’s ability to produce unique sentence fragments within training data are proposed.
The Privacy Onion Effect: Memorization is Relative
An Onion Effect of memorization is demonstrated and analysed: removing the “layer” of outlier points that are most vulnerable to a privacy attack exposes a new layer of previously-safe points to the same attack.
Privacy Leakage in Text Classification: A Data Extraction Approach
The potential privacy leakage in the text classification domain is studied by investigating the problem of unintended memorization of training data that is not pertinent to the learning task by proposing an algorithm to extract missing tokens of a partial text by exploiting the likelihood of the class label provided by the model.
Offline Reinforcement Learning with Differential Privacy
This work designs RL algorithms with provable privacy guarantees which enjoy strong instance-dependent learning bounds under both tabular and linear Markov decision process (MDP) settings and suggests that the privacy guarantee comes at (almost) no drop in utility comparing to the non-private counterpart for a medium-size dataset.
Differentially Private Decoding in Large Language Models
This work proposes a simple, easy to interpret, and computationally lightweight perturbation mechanism to be applied to an already trained model at the decoding stage, which is model-agnostic and can be used in conjunction with any LLM.
Memorization Without Overfitting: Analyzing the Training Dynamics of Large Language Models
It is shown that larger models can memorize a larger portion of the data before over-fitting and tend to forget less throughout the training process, and that larger language models memorize training data faster across all settings.
Memory Bounds for Continual Learning
It is established that any continual learner, even an improper one, needs memory that grows linearly with k, strongly suggesting that the problem is intractable and providing an algorithm based on multiplicative weights update whose memory requirement scales well.
Detecting Unintended Memorization in Language-Model-Fused ASR
This work designs a framework for detecting memorization of random textual sequences (which the authors call canaries) in the LM training data when one has only black-box (query) access to LM-fused speech recognizer, as opposed to direct access to the LM.
This work proposes a semantic synthesis strategy that enables realistic simulation without naturally-partitioned data, indicating that dataset synthesis strategy can be important for realistic simulations of generalization in federated learning.


Learners that Use Little Information
An approach that allows for upper bounds on the amount of information that algorithms reveal about their inputs is discussed, and a lower bound is provided by showing a simple concept class for which every empirical risk minimizer must reveal a lot of information is provided.
A Direct Sum Result for the Information Complexity of Learning
A direct sum result for information complexity is proved in this context; roughly speaking, the information complexity sums when combining several classes.
When is Memorization of Irrelevant Training Data Necessary for High-Accuracy Learning
  • 2020
Does learning require memorization? a short tale about a long tail
The model allows to quantify the effect of not fitting the training data on the generalization performance of the learned classifier and demonstrates that memorization is necessary whenever frequencies are long-tailed, and establishes a formal link between these empirical phenomena.
Communication complexity of estimating correlations
The results imply an Ω(n) lower bound on the information complexity of the Gap-Hamming problem, for which the proof techniques rely on symmetric strong data-processing inequalities and various tensorization techniques from information-theoretic interactive common-randomness extraction.
Sample Complexity Bounds on Differentially Private Learning via Communication Complexity
It is shown that the sample complexity of learning with (pure) differential privacy can be arbitrarily higher than the samplecomplexity of learning without the privacy constraint or the sample complex oflearning with approximate differential privacy.
Extracting Training Data from Large Language Models
This paper demonstrates that in such settings, an adversary can perform a training data extraction attack to recover individual training examples by querying the language model, and finds that larger models are more vulnerable than smaller models.
What Neural Networks Memorize and Why: Discovering the Long Tail via Influence Estimation
The experiments demonstrate the significant benefits of memorization for generalization on several standard benchmarks and provide quantitative and visually compelling evidence for the theory put forth in Feldman (2019), which proposes a theoretical explanation for this phenomenon.
The limits of pan privacy and shuffle privacy for learning and estimation
This work proves the first non-trivial lower bounds for high-dimensional learning and estimation in both the pan-private model and the general multi-message shuffle model.
A Limitation of the PAC-Bayes Framework
An easy learning task that is not amenable to a PAC-Bayes analysis is demonstrated, and it is shown that for any algorithm that learns 1-dimensional linear classifiers there exists a (realizable) distribution for which the PAC- Bayes bound is arbitrarily large.