• Corpus ID: 239050171

CLOOB: Modern Hopfield Networks with InfoLOOB Outperform CLIP

  title={CLOOB: Modern Hopfield Networks with InfoLOOB Outperform CLIP},
  author={Andreas Furst and Elisabeth Rumetshofer and Viet-Hung Tran and Hubert Ramsauer and Fei Tang and Johannes Lehner and David P. Kreil and Michael Kopp and G{\"u}nter Klambauer and Angela Bitto-Nemling and Sepp Hochreiter},
CLIP yielded impressive results on zero-shot transfer learning tasks and is considered as a foundation model like BERT or GPT3. CLIP vision models that have a rich representation are pre-trained using the InfoNCE objective and natural language supervision before they are fine-tuned on particular tasks. Though CLIP excels at zero-shot transfer learning, it suffers from an explaining away problem, that is, it focuses on one or few features, while neglecting other relevant features. This problem is… 

Figures and Tables from this paper

Contrastive Adapters for Foundation Model Group Robustness

Contrastive adapting is proposed, which trains adapters with contrastive learning to bring sample embeddings close to both their ground-truth class embeds and other sample embeds in the same class, which improves group robustness.

Robust Cross-Modal Representation Learning with Progressive Self-Distillation

A novel training framework based on cross-modal contrastive learning that uses progressive self-distillation and soft image-text alignments to more efficiently learn robust representations from noisy data is introduced.

Prototypical Contrastive Language Image Pretraining

ProtoCLIP is trained with an online episodic training strategy, which makes it can be scaled up to unlimited amounts of data, and proposed P rototypical B ack T ranslation (PBT) to decouple representation grouping from representation alignment, resulting in effective learning of meaningful representations under large modality gap.

Learning to Prompt for Vision-Language Models

Context Optimization (CoOp) is proposed, a simple approach specifically for adapting CLIP-like vision-language models for downstream image recognition and achieves superb domain generalization performance compared with the zero-shot model using hand-crafted prompts.

Conditional Prompt Learning for Vision-Language Models

Conditional Context Optimization (CoCoOp), which extends CoOp by further learning a lightweight neural network to generate for each image an input-conditional token (vector), and yields stronger domain generalization performance as well.

CLAMP: Prompt-based Contrastive Learning for Connecting Language and Animal Pose

A novel prompt-based Contrastive learning scheme for connecting Language and AniMal Pose (CLAMP) effectively that enables a new cross-modal animal pose estimation paradigm.

Align-RUDDER: Learning From Few Demonstrations by Reward Redistribution

Align-RUDDER is introduced, which is RUDDER with two major modifications, which replaces RUDder's LSTM model by a profile model that is obtained from multiple sequence alignment of demonstrations, which considerably reduces the delay of rewards, thus speeding up learning.

Txt2Img-MHN: Remote Sensing Image Generation from Text Using Modern Hopfield Networks

The proposed Txt2Img-MHN can generate more realistic remote sensing images than existing methods and the overall accuracy in the zero-shot classification may serve as a good metric to evaluate the ability to generate an image from text.

ConStruct-VL: Data-Free Continual Structured VL Concepts Learning

This work proposes a data-free method comprised of a new approach of Adversarial Pseudo-Replay (APR) which generates adversarial reminders of past tasks from past task models and proposes a continual parameter-efficient Layered-LoRA (LaLo) neural architecture allowing no-memory-cost access to all past models at train time.

Teaching Structured Vision&Language Concepts to Vision&Language Models

Various techniques based on language structure understanding can be used to manipulate the textual part of off-the-shelf paired VL datasets that makes more effective use of existing VL pre-training datasets and does not require any additional data.



Exploring Simple Siamese Representation Learning

  • Xinlei ChenKaiming He
  • Computer Science
    2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
  • 2021
Surprising empirical results are reported that simple Siamese networks can learn meaningful representations even using none of the following: (i) negative sample pairs, (ii) large batches, (iii) momentum encoders.

Hopular: Modern Hopfield Networks for Tabular Data

Hopular is a novel Deep Learning architecture for mediumand smallsized datasets, where each layer is equipped with continuous modern Hopfield networks, and surpasses Gradient Boosting, Random Forests, SVMs, and in particular several Deep Learning methods on tabular data.

How Much Can CLIP Benefit Vision-and-Language Tasks?

It is shown that CLIP significantly outperforms widely-used visual encoders trained with in-domain annotated data, such as BottomUp-TopDown, and also establishes new state-of-the-art results on Visual Question Answering, Visual Entailment, and V&L Navigation tasks.

SimCSE: Simple Contrastive Learning of Sentence Embeddings

SimCSE is presented, a simple contrastive learning framework that greatly advances the state-of-the-art sentence embeddings and regularizes pre-trainedembeddings’ anisotropic space to be more uniform, and it better aligns positive pairs when supervised signals are available.

A Simple Framework for Contrastive Learning of Visual Representations

It is shown that composition of data augmentations plays a critical role in defining effective predictive tasks, and introducing a learnable nonlinear transformation between the representation and the contrastive loss substantially improves the quality of the learned representations, and contrastive learning benefits from larger batch sizes and more training steps compared to supervised learning.

Learning to Prompt for Vision-Language Models

Context Optimization (CoOp) is proposed, a simple approach specifically for adapting CLIP-like vision-language models for downstream image recognition and achieves superb domain generalization performance compared with the zero-shot model using hand-crafted prompts.

Understanding Contrastive Representation Learning through Alignment and Uniformity on the Hypersphere

This work identifies two key properties related to the contrastive loss: alignment (closeness) of features from positive pairs, and uniformity of the induced distribution of the (normalized) features on the hypersphere.

Learning Transferable Visual Models From Natural Language Supervision

It is demonstrated that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet.

Understanding Dimensional Collapse in Contrastive Self-supervised Learning

Inspired by the theory, a novel contrastive learning method is proposed, called DirectCLR, which directly optimizes the representation space without relying on an explicit trainable projector and experiments show that Direct CLR outperforms SimCLR with a trainable linear projector on ImageNet.

Conditional Negative Sampling for Contrastive Learning of Visual Representations

This paper introduces a family of mutual information estimators that sample negatives conditionally -- in a "ring" around each positive -- and proves that these estimators lower-bound mutual information, with higher bias but lower variance than NCE.