• Corpus ID: 239998692

Learning where to learn: Gradient sparsity in meta and continual learning

  title={Learning where to learn: Gradient sparsity in meta and continual learning},
  author={Johannes von Oswald and Dominic Zhao and Seijin Kobayashi and Simon Schug and Massimo Caccia and Nicolas Zucchet and Jo{\~a}o Sacramento},
  booktitle={Neural Information Processing Systems},
Finding neural network weights that generalize well from small datasets is difficult. A promising approach is to learn a weight initialization such that a small number of weight changes results in low generalization error. We show that this form of meta-learning can be improved by letting the learning algorithm decide which weights to change, i.e., by learning where to learn. We find that patterned sparsity emerges from this process, with the pattern of sparsity varying on a problem-by-problem… 

Figures and Tables from this paper

Meta-Learning via Classifier(-free) Guidance

This work takes inspiration from recent advances in generative modeling and language-conditioned image synthesis to propose meta-learning techniques that use natural language guidance to achieve higher zero-shot performance compared to the state-of-the-art.

Meta-ticket: Finding optimal subnetworks for few-shot learning within randomly initialized neural networks

This work proposes a novel meta-learning approach, called Meta-ticket, to find optimal sparse subnetworks for few-shot learning within randomly initialized NNs that achieves superior metageneralization compared to MAML-based methods especially with large NNs.

Meta-Learning with Self-Improving Momentum Target

This work proposes a simple yet effective method, coined Self-improving Momentum Target (SiMT), which generates the target model by adapting from the temporal ensemble of the meta-learner, i.e., the momentum network, and demonstrates that SiMT brings a significant performance gain when combined with a wide range of meta-learning methods under various applications.

Robust Meta-learning with Sampling Noise and Label Noise via Eigen-Reptile

Eigen-Reptile (ER) is presented that updates the meta-parameters with the main direction of historical task-specific parameters to alleviate sampling and label noise and is able to outperform or achieve highly competitive performance compared with other gradient-based methods with or without noisy labels.

Continuous-Time Meta-Learning with Forward Mode Differentiation

This work introduces Continuous-Time Meta-Learning (COMLN), a meta-learning algorithm where adaptation follows the dynamics of a gradient vector field, and devise an efficient algorithm based on forward mode differentiation, whose memory requirements do not scale with the length of the learning trajectory, thus allowing longer adaptation in constant memory.

Continual Feature Selection: Spurious Features in Continual Learning

A way of understanding performance decrease in continual learning by highlighting the influence of (local) spurious features in algorithms capabilities is presented.

New Insights on Reducing Abrupt Representation Change in Online Continual Learning

This work focuses on the change in representations of observed data that arises when previously unobserved classes appear in the incoming data stream, and new classes must be distinguished from previous ones, and shows that using an asymmetric update rule pushes new classes to adapt to the older ones (rather than the reverse), which is more effective especially at task boundaries.

MetaNODE: Prototype Optimization as a Neural ODE for Few-Shot Learning

This paper proposes a novel meta-learning based prototype optimization framework to rectify prototypes, i.e., introducing a meta-optimizer to optimize prototypes by regard the gradient and its flow as meta-knowledge and propose a novel Neural Ordinary Differential Equation (ODE)-based meta- Optimizer to polish prototypes, called MetaNODE.

MetaFaaS: learning-to-learn on serverless

This work proposes MetaFaaS, a function-as-a-service (FAAS) paradigm on public cloud to build a scalable and cost-performance optimal deployment framework for gradient-based meta-learning architectures, and proposes an analytical model to predict the cost and training time on cloud for a given workload.

Remember the Past: Distilling Datasets into Addressable Memories for Neural Networks

We propose an algorithm that compresses the critical information of a large dataset into compact addressable memories. These memories can then be recalled to quickly re-train a neural network and



Crossprop: Learning Representations by Stochastic Meta-Gradient Descent in Neural Networks

This paper introduces a new incremental learning algorithm called crossprop, which learns incoming weights of hidden units based on the meta-gradient descent approach, that was previously introduced by Sutton (1992) and Schraudolph (1999) for learning step-sizes.

Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks

We propose an algorithm for meta-learning that is model-agnostic, in the sense that it is compatible with any model trained with gradient descent and applicable to a variety of different learning

Learning to Learn without Forgetting By Maximizing Transfer and Minimizing Interference

This work proposes a new conceptualization of the continual learning problem in terms of a temporally symmetric trade-off between transfer and interference that can be optimized by enforcing gradient alignment across examples, and introduces a new algorithm, Meta-Experience Replay, that directly exploits this view by combining experience replay with optimization based meta-learning.

Meta-Learning Representations for Continual Learning

It is shown that it is possible to learn naturally sparse representations that are more effective for online updating and it is demonstrated that a basic online updating strategy on representations learned by OML is competitive with rehearsal based methods for continual learning.

Meta-Learning with Warped Gradient Descent

WarpGrad meta-learns an efficiently parameterised preconditioning matrix that facilitates gradient descent across the task distribution and is computationally efficient, easy to implement, and can scale to arbitrarily large meta-learning problems.

Meta-Learning With Differentiable Convex Optimization

The objective is to learn feature embeddings that generalize well under a linear classification rule for novel categories and this work exploits two properties of linear classifiers: implicit differentiation of the optimality conditions of the convex problem and the dual formulation of the optimization problem.

Meta-Learning via Hypernetworks

This work proposes a soft row-sharing hypernetwork architecture and shows that training the hypernetwork with a variant of MAML is tightly linked to meta-learning a curvature matrix used to condition gradients during fast adaptation, and empirically shows that hypernetworks do leverage the inner loop optimization for better adaptation.

Learning Feature Relevance Through Step Size Adaptation in Temporal-Difference Learning

This paper examines an instance of meta-learning in which feature relevance is learned by adapting step size parameters of stochastic gradient descent, and extends IDBD to temporal-difference learning---a form of learning which is effective in sequential, non i.i.d. problems.

La-MAML: Look-ahead Meta Learning for Continual Learning

This work proposes Look-ahead MAML (La-MAML), a fast optimisation-based meta-learning algorithm for online-continual learning, aided by a small episodic memory, and proposed modulation of per-parameter learning rates in this update provides a more flexible and efficient way to mitigate catastrophic forgetting compared to conventional prior-based methods.

Meta-SGD: Learning to Learn Quickly for Few Shot Learning

Meta-SGD, an SGD-like, easily trainable meta-learner that can initialize and adapt any differentiable learner in just one step, shows highly competitive performance for few-shot learning on regression, classification, and reinforcement learning.