SVL-Adapter: Self-Supervised Adapter for Vision-Language Pretrained Models

@article{Pantazis2022SVLAdapterSA,
  title={SVL-Adapter: Self-Supervised Adapter for Vision-Language Pretrained Models},
  author={Omiros Pantazis and Gabriel J. Brostow and Kate Jones and Oisin Mac Aodha},
  journal={ArXiv},
  year={2022},
  volume={abs/2210.03794}
}
Vision-language models such as CLIP are pretrained on large volumes of internet sourced image and text pairs, and have been shown to sometimes exhibit impressive zeroand low-shot image classification performance. However, due to their size, finetuning these models on new datasets can be prohibitively expensive, both in terms of the supervision and compute required. To combat this, a series of light-weight adaptation methods have been proposed to efficiently adapt such models when limited… 
1 Citations

SuS-X: Training-Free Name-Only Transfer of Vision-Language Models

A novel method, SuS-X, consisting of two key building blocks—“SuS” and “TIP-X”, that requires nei-ther intensive fine-tuning nor costly labelled data is proposed and achieves state-of-the-art zero-shot classification results on 19 benchmark datasets.

References

SHOWING 1-10 OF 68 REFERENCES

CLIP-Adapter: Better Vision-Language Models with Feature Adapters

This paper shows that there is an alternative path to achieve better vision-language models other than prompt tuning, and proposes CLIP-Adapter to conduct fine-tuning with feature adapters on either visual or language branch.

Learning Transferable Visual Models From Natural Language Supervision

It is demonstrated that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet.

Tip-Adapter: Training-free CLIP-Adapter for Better Vision-Language Modeling

Tip-Adapter does not require any back propagation for training the adapter, but creates the weights by a key-value cache model constructed from the few-shot training set, which acquires wellperformed adapter weights without any training, which is both efficient and effective.

Unsupervised Prompt Learning for Vision-Language Models

This paper presents an unsupervised prompt learning (UPL) approach to avoid prompt engineering while simultaneously improving transfer performance of CLIP-like vision-language models.

Learning to Prompt for Vision-Language Models

Context Optimization (CoOp) is proposed, a simple approach specifically for adapting CLIP-like vision-language models for downstream image recognition that achieves superb domain generalization performance compared with the zero-shot model using hand-crafted prompts.

Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision

A simple dual-encoder architecture learns to align visual and language representations of the image and text pairs using a contrastive loss, and it is shown that the scale of the corpus can make up for its noise and leads to state-of-the-art representations even with such a simple learning scheme.

Rethinking Few-Shot Image Classification: a Good Embedding Is All You Need?

It is shown that a simple baseline: learning a supervised or self-supervised representation on the meta-training set, followed by training a linear classifier on top of this representation, outperforms state-of-the-art few-shot learning methods.

Pushing the Limits of Simple Pipelines for Few-Shot Learning: External Data and Fine-Tuning Make a Difference

It is shown that a simple transformer-based pipeline yields surprisingly good performance on standard benchmarks such as Mini-ImageNet, CIFAR-FS, CDFSL and Meta-Dataset.

When Does Contrastive Visual Representation Learning Work?

Recent self-supervised representation learning techniques have largely closed the gap between supervised and unsupervised learning on ImageNet classification. While the particulars of pretraining on

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Vision Transformer (ViT) attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train.
...