How catastrophic can catastrophic forgetting be in linear regression?

  title={How catastrophic can catastrophic forgetting be in linear regression?},
  author={Itay Evron and Edward Moroshko and Rachel A. Ward and Nati Srebro and Daniel Soudry},
To better understand catastrophic forgetting, we study fitting an overparameterized linear model to a sequence of tasks with different input distributions. We analyze how much the model forgets the true labels of earlier tasks after training on subsequent tasks, obtaining exact expressions and bounds. We establish connections between continual learning in the linear setting and two other research areas – alternating projections and the Kaczmarz method. In specific settings, we highlight… 

Figures from this paper


Matrix Analysis and Applied Linear Algebra
The author presents Perron-Frobenius theory of nonnegative matrices Index, a theory of matrices that combines linear equations, vector spaces, and matrix algebra with insights into eigenvalues and Eigenvectors.
Continual Learning in the Teacher-Student Setup: Impact of Task Similarity
This work extends previous analytical work on two-layer networks in the teacher-student setup to multiple teachers and finds a complex interplay between both types of similarity, initial transfer/forgetting rates, maximum transfer/ forgetting, and long-term transfer/Forgetting.
A Theoretical Analysis of Catastrophic Forgetting through the NTK Overlap Matrix
This paper introduces a measure of task similarity called the NTK overlap matrix which is at the core of CF and proposes a variant of Orthogonal Gradient Descent (OGD) which leverages structure of the data through Principal Component Analysis (PCA).
Block-iterative methods for consistent and inconsistent linear equations
An application is given to the linear system that arises from reconstruction of a two-dimensional object by its one-dimensional projections.
Error bounds for the method of alternating projections
The method of alternating projections produces a sequence which converges to the orthogonal projection onto the intersection of the subspaces, and the sharpest known upper bound for more than two subspaced is obtained.
Principal Values and Principal Subspaces of Two Subspaces of Vector Spaces with Inner Product
In this paper is studied the problem concerning the angle between two subspaces of arbitrary dimensions in Euclidean space En. It is proven that the angle between two subspaces is equal to the angle
Majorization for Changes in Angles Between Subspaces, Ritz Values, and Graph Laplacian Spectra
The result for the squares of cosines can be viewed as a bound on the change in the Ritz values of an orthogonal projector, and the conjecture that the root two factor in the earlier estimate may be eliminated is eliminated.
Generalisation Guarantees for Continual Learning with Orthogonal Gradient Descent
This work derives the first generalisation guarantees for the algorithm OGD for continual learning, for overparameterized neural networks, and proves that it is robust to catastrophic forgetting across an arbitrary number of tasks, and that it verifies tighter generalisation bounds.
Understanding the Role of Training Regimes in Continual Learning
This work hypothesizes that the geometrical properties of the local minima found for each task play an important role in the overall degree of forgetting, and studies the effect of dropout, learning rate decay, and batch size, on forming training regimes that widen the tasks'Local minima and consequently, on helping it not to forget catastrophically.
Nonasymptotic convergence of stochastic proximal point methods for constrained convex optimization
This work introduces a new variant of the SPP method for solving stochastic convex problems subject to (in)finite intersection of constraints satisfying a linear regularity condition, and proves new nonasymptotic convergence results for convex Lipschitz continuous objective functions.