• Corpus ID: 235352709

Self-Attention Between Datapoints: Going Beyond Individual Input-Output Pairs in Deep Learning

  title={Self-Attention Between Datapoints: Going Beyond Individual Input-Output Pairs in Deep Learning},
  author={Jannik Kossen and Neil Band and Clare Lyle and Aidan N. Gomez and Tom Rainforth and Yarin Gal},
We challenge a common assumption underlying most supervised deep learning : that a model makes a prediction depending only on its parameters and the features of a single input . To this end, we introduce a general-purpose deep learning architecture that takes as input the entire dataset instead of processing one datapoint at a time. Our approach uses self-attention to reason about relationships between datapoints explicitly, which can be seen as realizing non-parametric models using parametric… 
Semi-Parametric Deep Neural Networks in Linear Time and Memory
SPIN is introduced, a general-purpose semi-parametric neural architecture whose computational cost is linear in the size and dimensionality of the data and improves state-of-the-art performance on an important practical problem, genotype imputation.
BatchFormerV2: Exploring Sample Relationships for Dense Representation Learning
A more general batch Transformer module, BatchFormerV2, which further enables exploring sample relationships for dense representation learning and consistently improves current DETR-based detection methods by over 1.3%.
On Embeddings for Numerical Features in Tabular Deep Learning
It is argued that embeddings for numerical features are an underexplored degree of freedom in tabular DL, which allows constructing more powerful DL models and competing with GBDT on some traditionally GBDT-friendly benchmarks.
Hopular: Modern Hopfield Networks for Tabular Data
Hopular is a novel Deep Learning architecture for mediumand smallsized datasets, where each layer is equipped with continuous modern Hopfield networks, and surpasses Gradient Boosting, Random Forests, SVMs, and in particular several Deep Learning methods on tabular data.
Revisiting a kNN-based Image Classification System with High-capacity Storage
This paper investigates a system that stores knowledge for image classification, such as image feature maps, labels, and original images, not in model parameters but in external high-capacity storage, and refers to the storage like a database when classifying input images.
Tip-Adapter: Training-free CLIP-Adapter for Better Vision-Language Modeling
Tip-Adapter does not require any back propagation for training the adapter, but creates the weights by a key-value cache model constructed from the few-shot training set, which acquires wellperformed adapter weights without any training, which is both efficient and effective.
Augmenting Message Passing by Retrieving Similar Graphs
This paper proposes a non-parametric scheme called GraphRetrieval, in which similar training graphs associated with their ground-truth labels are retrieved to be jointly utilized with the input graph representation to complete various graph-based predictive tasks.
Multivariate Time Series Forecasting with Latent Graph Inference
A new approach for Multivariate Time Series forecasting that jointly infers and leverages relations among time series and its modularity allows it to be integrated with current univariate methods.


Learning Intra-Batch Connections for Deep Metric Learning
This work proposes an approach based on message passing networks that takes into account all the relations in a mini-batch of samples into account, and refine embedding vectors by exchanging messages among all samples in a given batch allowing the training process to be aware of the overall structure.
Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks
We propose an algorithm for meta-learning that is model-agnostic, in the sense that it is compatible with any model trained with gradient descent and applicable to a variety of different learning
Revisiting Unreasonable Effectiveness of Data in Deep Learning Era
It is found that the performance on vision tasks increases logarithmically based on volume of training data size, and it is shown that representation learning (or pre-training) still holds a lot of promise.
Why Does Unsupervised Pre-training Help Deep Learning?
The results suggest that unsupervised pre-training guides the learning towards basins of attraction of minima that support better generalization from the training data set; the evidence from these results supports a regularization explanation for the effect of pre- training.
Matching Networks for One Shot Learning
This work employs ideas from metric learning based on deep neural features and from recent advances that augment neural networks with external memories to learn a network that maps a small labelled support set and an unlabelled example to its label, obviating the need for fine-tuning to adapt to new class types.
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.
Big Transfer (BiT): General Visual Representation Learning
By combining a few carefully selected components, and transferring using a simple heuristic, Big Transfer achieves strong performance on over 20 datasets and performs well across a surprisingly wide range of data regimes -- from 1 example per class to 1M total examples.
Perceiver: General Perception with Iterative Attention
This paper introduces the Perceiver – a model that builds upon Transformers and hence makes few architectural assumptions about the relationship between its inputs, but that also scales to hundreds of thousands of inputs, like ConvNets.
Object-Centric Learning with Slot Attention
An architectural component that interfaces with perceptual representations such as the output of a convolutional neural network and produces a set of task-dependent abstract representations which are exchangeable and can bind to any object in the input by specializing through a competitive procedure over multiple rounds of attention is presented.
Attentive Neural Processes
Attention is incorporated into NPs, allowing each input location to attend to the relevant context points for the prediction, which greatly improves the accuracy of predictions, results in noticeably faster training, and expands the range of functions that can be modelled.