Masked Unsupervised Self-training for Zero-shot Image Classification

  title={Masked Unsupervised Self-training for Zero-shot Image Classification},
  author={Junnan Li and Silvio Savarese and Steven C. H. Hoi},
State-of-the-art computer vision models are mostly trained with supervised learning using human-labeled images, which limits their scalability due to the expensive annotation cost. While self-supervised representation learning has achieved impressive progress, it still requires a second stage of finetuning on labeled data. On the other hand, models pre-trained with large-scale text-image supervision (e.g., CLIP) have enabled zero-shot transfer to downstream image classification tasks. However… 

Figures and Tables from this paper

SgVA-CLIP: Semantic-guided Visual Adapting of Vision-Language Models for Few-shot Image Classification

A new framework is proposed, named Semantic-guided Visual Adapting (SgVA), which can effectively extend vision-language pre-trained models to produce discriminative task-specific visual features by comprehensively using a vision-specific contrastive loss, a cross-modal contrastive Loss, and an implicit knowledge distillation.

Neighborhood-Regularized Self-Training for Learning with Few Labels

Inspired by the fact that samples with similar labels tend to share similar rep- resentations, a neighborhood-based sample selection approach is developed to tackle the issue of noisy pseudo labels and stabilize self-training via aggregating the predictions from different rounds during sample selection.

Dataset Summarization by K Principal Concepts

The approach provides a more explicit summary in comparison to selecting K representative images, which are often ambiguous, and the K principal concepts can be used to classify the dataset into K groups.

Improving Zero-Shot Models with Label Distribution Priors

A new approach is proposed, CLIPPR 1 ( CLIP with Pr iors), which adapts zero-shot models for regression and classification on unlabelled datasets and presents an improvement of 28% in mean absolute error on the UTK age regression task.



Learning Transferable Visual Models From Natural Language Supervision

It is demonstrated that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet.

Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm

By carefully utilizing the widespread supervision among the image-text pairs, the DeCLIP can learn generic visual features more efficiently and exploit data potential through the use of self-supervision within each modality; multi-view supervision across modalities; and nearest-neighbor supervision from other similar pairs.

MoPro: Webly Supervised Learning with Momentum Prototypes

We propose a webly-supervised representation learning method that does not suffer from the annotation unscalability of supervised learning, nor the computation unscalability of self-supervised

Better Self-training for Image Classification through Self-supervision

Empirical results show that applying self-supervision only in the first iteration of self-training can greatly improve accuracy, for a modest increase in computation time.

Rethinking Pre-training and Self-training

Self-training works well exactly on the same setup that pre-training does not work (using ImageNet to help COCO), and on the PASCAL segmentation dataset, though pre- training does help significantly, self-training improves upon the pre-trained model.

LiT: Zero-Shot Transfer with Locked-image text Tuning

In the empirical study, it is found that locked pre-trained image mod-els with unlocked text models work best, and the proposed LiT model achieves 84.5% zero-shot trans-fer accuracy on the ImageNet test set, and 81.1% on the challenging out-of-distribution Object net test set.

Debiased Learning from Naturally Imbalanced Pseudo-Labels for Zero-Shot and Semi-Supervised Learning

To eliminate the model bias, this work proposes a simple yet effective method DebiasMatch, comprising of an adaptive debiasing module and an adaptive marginal loss, which significantly outperforms previous state-of-the-arts learning tasks.

Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision

A simple dual-encoder architecture learns to align visual and language representations of the image and text pairs using a contrastive loss, and it is shown that the scale of the corpus can make up for its noise and leads to state-of-the-art representations even with such a simple learning scheme.

CLIP-Adapter: Better Vision-Language Models with Feature Adapters

This paper shows that there is an alternative path to achieve better vision-language models other than prompt tuning, and proposes CLIP-Adapter to conduct fine-tuning with feature adapters on either visual or language branch.

A Survey of Zero-Shot Learning

This paper categorizes existing zero-shot learning methods and introduces representative methods under each category, and highlights promising future research directions of zero- shot learning.