When is memorization of irrelevant training data necessary for high-accuracy learning?

@article{Brown2021WhenIM,
  title={When is memorization of irrelevant training data necessary for high-accuracy learning?},
  author={Gavin Brown and Mark Bun and Vitaly Feldman and Adam M. Smith and Kunal Talwar},
  journal={Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing},
  year={2021}
}
  • Gavin Brown, Mark Bun, Kunal Talwar
  • Published 11 December 2020
  • Computer Science
  • Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing
Modern machine learning models are complex and frequently encode surprising amounts of information about individual inputs. In extreme cases, complex models appear to memorize entire input examples, including seemingly irrelevant information (social security numbers from text, for example). In this paper, we aim to understand whether this sort of memorization is necessary for accurate learning. We describe natural prediction problems in which every sufficiently accurate training algorithm must… 

Figures and Tables from this paper

Privacy Analysis in Language Models via Training Data Leakage Report
TLDR
A methodology that investigates identifying the user content in the training data that could be leaked under a strong and realistic threat model is introduced and two metrics to quantify user-level data leakage by measuring a model’s ability to produce unique sentence fragments within training data are proposed.
A Theory of PAC Learnability of Partial Concept Classes
TLDR
The classical theory of PAC learning is extended in a way which allows to model a rich variety of practical learning tasks where the data satisfy special properties that ease the learning process, and it is shown that the ERM principle fails spectacularly in explaining learnability of partial concept classes.
Datamodels: Predicting Predictions from Training Data
TLDR
It is shown that even simple linear datamodels can successfully predict model outputs and give rise to a variety of applications, such as: accurately predicting the effect of dataset counterfactuals; identifying brittle predictions; finding semantically similar examples; quantifying train-test leakage; and embedding data into a well-behaved and feature-rich representation space.
Deletion Inference, Reconstruction, and Compliance in Machine (Un)Learning
TLDR
Inspired by cryptographic definitions and the differential privacy framework, this work formally study privacy implications of machine unlearning and formalizes deletion inference and deletion reconstruction attacks, in which the adversary aims to either identify which record is deleted or to reconstruct (perhaps part of) the deleted records.
Detecting Unintended Memorization in Language-Model-Fused ASR
TLDR
This work designs a framework for detecting memorization of random textual sequences (which the authors call canaries) in the LM training data when one has only black-box (query) access to LM-fused speech recognizer, as opposed to direct access to the LM.
FEDERATED LEARNING?
TLDR
This work proposes a semantic synthesis strategy that enables realistic simulation without naturally-partitioned data, indicating that dataset synthesis strategy can be important for realistic simulations of generalization in federated learning.
Memorize to Generalize: on the Necessity of Interpolation in High Dimensional Linear Regression
TLDR
It is characterized how prediction (test) error necessarily scales with training error in this setting and any estimator that incurs at least c σ 4 training error for some constant c is necessarily suboptimal and will grow in excess prediction error at least linear in the training error.
Memory Bounds for Continual Learning
TLDR
It is established that any continual learner, even an improper one, needs memory that grows linearly with k, strongly suggesting that the problem is intractable and providing an algorithm based on multiplicative weights update whose memory requirement scales well.
Reconstructing Training Data with Informed Adversaries
TLDR
This work provides an effective reconstruction attack that model developers can use to assess memorization of individual points in general settings beyond those considered in previous works, and demonstrates that standard models have the capacity to store enough information to enable high-fidelity reconstruction of training data points.
Covariance-Aware Private Mean Estimation Without Private Covariance Estimation
TLDR
Two sample-efficient differentially private mean estimators for ddimensional (sub)Gaussian distributions with unknown covariance are presented, and sample complexity guarantees hold more generally for subgaussian distributions, albeit with a slightly worse dependence on the privacy parameter.
...
1
2
...

References

SHOWING 1-10 OF 51 REFERENCES
When is Memorization of Irrelevant Training Data Necessary for High-Accuracy Learning
  • 2020
A Direct Sum Result for the Information Complexity of Learning
TLDR
A direct sum result for information complexity is proved in this context; roughly speaking, the information complexity sums when combining several classes.
Learners that Use Little Information
TLDR
An approach that allows for upper bounds on the amount of information that algorithms reveal about their inputs is discussed, and a lower bound is provided by showing a simple concept class for which every empirical risk minimizer must reveal a lot of information is provided.
Communication complexity of estimating correlations
TLDR
The results imply an Ω(n) lower bound on the information complexity of the Gap-Hamming problem, for which the proof techniques rely on symmetric strong data-processing inequalities and various tensorization techniques from information-theoretic interactive common-randomness extraction.
Does learning require memorization? a short tale about a long tail
TLDR
The model allows to quantify the effect of not fitting the training data on the generalization performance of the learned classifier and demonstrates that memorization is necessary whenever frequencies are long-tailed, and establishes a formal link between these empirical phenomena.
Sample Complexity Bounds on Differentially Private Learning via Communication Complexity
TLDR
It is shown that the sample complexity of learning with (pure) differential privacy can be arbitrarily higher than the samplecomplexity of learning without the privacy constraint or the sample complex oflearning with approximate differential privacy.
Extracting Training Data from Large Language Models
TLDR
This paper demonstrates that in such settings, an adversary can perform a training data extraction attack to recover individual training examples by querying the language model, and finds that larger models are more vulnerable than smaller models.
The limits of pan privacy and shuffle privacy for learning and estimation
TLDR
This work proves the first non-trivial lower bounds for high-dimensional learning and estimation in both the pan-private model and the general multi-message shuffle model.
A Limitation of the PAC-Bayes Framework
TLDR
An easy learning task that is not amenable to a PAC-Bayes analysis is demonstrated, and it is shown that for any algorithm that learns 1-dimensional linear classifiers there exists a (realizable) distribution for which the PAC- Bayes bound is arbitrarily large.
Communication Complexity: and Applications
...
1
2
3
4
5
...