Phase Transitions in Transfer Learning for High-Dimensional Perceptrons

  title={Phase Transitions in Transfer Learning for High-Dimensional Perceptrons},
  author={Oussama Dhifallah and Yue M. Lu},
Transfer learning seeks to improve the generalization performance of a target task by exploiting the knowledge learned from a related source task. Central questions include deciding what information one should transfer and when transfer can be beneficial. The latter question is related to the so-called negative transfer phenomenon, where the transferred source information actually reduces the generalization performance of the target task. This happens when the two tasks are sufficiently… 

Figures from this paper

On the Inherent Regularization Effects of Noise Injection During Training
This paper provides a precise asymptotic characterization of the training and generalization errors of such randomly perturbed learning problems on a random feature model and shows that Gaussian noise injection in the training process is equivalent to introducing a weighted ridge regularization, when the number of noise injections tends to infinity.
Gaussian Universality of Linear Classifiers with Random Labels in High-Dimension
A rigorous proof that data coming from a range of generative models in high-dimensions have the same minimum training loss as Gaussian data with corresponding data covariance, and shows that this universality property is observed in practice with real datasets and random labels.
Maslow's Hammer for Catastrophic Forgetting: Node Re-Use vs Node Activation
This paper theoretically analyse both a synthetic teacher-student framework and a real data setup to provide an explanation of the presence of a trade-off between node activation and node re-use that results in worst forgetting in the intermediate regime.
Probing transfer learning with a model of synthetic correlated datasets
Focusing on the problem of training two-layer networks in a binary classification setting, this work re-think a solvable model of synthetic data as a framework for modeling correlation between data-sets and shows that this model can capture a range of salient features of transfer learning with real data.
A Farewell to the Bias-Variance Tradeoff? An Overview of the Theory of Overparameterized Machine Learning
This paper provides a succinct overview of this emerging theory of overparameterized ML (henceforth abbreviated as TOPML) that explains these recent findings through a statistical signal processing perspective and emphasizes the unique aspects that define the TOPML research area as a subfield of modern ML theory.
Continual Learning in the Teacher-Student Setup: Impact of Task Similarity
This work extends previous analytical work on two-layer networks in the teacher-student setup to multiple teachers and finds a complex interplay between both types of similarity, initial transfer/forgetting rates, maximum transfer/ forgetting, and long-term transfer/Forgetting.
The Common Intuition to Transfer Learning Can Win or Lose: Case Studies for Linear Regression
It is demonstrated that transfer learning can beat the minimum mean square error (MMSE) solution of the independent target task and, by that, to have an improved MMSE solution.


Double Double Descent: On Generalization Errors in Transfer Learning between Linear Regression Tasks
The non-asymptotic analysis shows that the generalization error of the target task follows a two-dimensional double descent trend (with respect to the number of free parameters in each of the tasks) that is controlled by the transfer learning factors.
An analytic theory of generalization dynamics and transfer learning in deep linear networks
An analytic theory of the nonlinear dynamics of generalization in deep linear networks, both within and across tasks is developed and reveals that knowledge transfer depends sensitively, but computably, on the SNRs and input feature alignments of pairs of tasks.
Direct Transfer of Learned Information Among Neural Networks
By transferring weights from smaller networks trained on subtasks, this paper achieved speedups of up to an order of magnitude compared with training starting with random weights, even taking into account the time to train the smaller networks.
Solvable Model for Inheriting the Regularization through Knowledge Distillation
A statistical physics framework is introduced that allows an analytic characterization of the properties of knowledge distillation (KD) in shallow neural networks and it is shown that, through KD, the regularization properties of the larger teacher model can be inherited by the smaller student.
A Survey on Transfer Learning
The relationship between transfer learning and other related machine learning techniques such as domain adaptation, multitask learning and sample selection bias, as well as covariate shift are discussed.
Task Clustering and Gating for Bayesian Multitask Learning
A Bayesian approach is adopted in which some of the model parameters are shared and others more loosely connected through a joint prior distribution that can be learned from the data to combine the best parts of both the statistical multilevel approach and the neural network machinery.
Transfer of Learning
Findings from various sources suggest that transfer happens by way of two rather different mechanisms, and conventional educational practices often fail to establish the conditions either for reflexive or mindful transfer.
To transfer or not to transfer
One challenge for transfer learning research is to develop approaches that detect and avoid negative transfer using very little data from the target task.
Discriminability-Based Transfer between Neural Networks
A new algorithm, called Discriminability-Based Transfer (DBT), is presented, which uses an information measure to estimate the utility of hyperplanes defined by source weights in the target network, and rescales transferred weight magnitudes accordingly.
Exploiting Task Relatedness for Mulitple Task Learning
This work offers an alternative approach to multiple task learning, defining relatedness of tasks on the basis of similarity between the example generating distributions that underline these task.