Masked Unsupervised Self-training for Zero-shot Image Classification
@article{Li2022MaskedUS, title={Masked Unsupervised Self-training for Zero-shot Image Classification}, author={Junnan Li and Silvio Savarese and Steven C. H. Hoi}, journal={ArXiv}, year={2022}, volume={abs/2206.02967} }
State-of-the-art computer vision models are mostly trained with supervised learning using human-labeled images, which limits their scalability due to the expensive annotation cost. While self-supervised representation learning has achieved impressive progress, it still requires a second stage of finetuning on labeled data. On the other hand, models pre-trained with large-scale text-image supervision (e.g., CLIP) have enabled zero-shot transfer to downstream image classification tasks. However…
Figures and Tables from this paper
4 Citations
SgVA-CLIP: Semantic-guided Visual Adapting of Vision-Language Models for Few-shot Image Classification
- Computer ScienceArXiv
- 2022
A new framework is proposed, named Semantic-guided Visual Adapting (SgVA), which can effectively extend vision-language pre-trained models to produce discriminative task-specific visual features by comprehensively using a vision-specific contrastive loss, a cross-modal contrastive Loss, and an implicit knowledge distillation.
Neighborhood-Regularized Self-Training for Learning with Few Labels
- Computer ScienceArXiv
- 2023
Inspired by the fact that samples with similar labels tend to share similar rep- resentations, a neighborhood-based sample selection approach is developed to tackle the issue of noisy pseudo labels and stabilize self-training via aggregating the predictions from different rounds during sample selection.
Dataset Summarization by K Principal Concepts
- Computer Science
- 2021
The approach provides a more explicit summary in comparison to selecting K representative images, which are often ambiguous, and the K principal concepts can be used to classify the dataset into K groups.
Improving Zero-Shot Models with Label Distribution Priors
- Computer ScienceArXiv
- 2022
A new approach is proposed, CLIPPR 1 ( CLIP with Pr iors), which adapts zero-shot models for regression and classification on unlabelled datasets and presents an improvement of 28% in mean absolute error on the UTK age regression task.
References
SHOWING 1-10 OF 51 REFERENCES
Learning Transferable Visual Models From Natural Language Supervision
- Computer ScienceICML
- 2021
It is demonstrated that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet.
Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm
- Computer ScienceArXiv
- 2021
By carefully utilizing the widespread supervision among the image-text pairs, the DeCLIP can learn generic visual features more efficiently and exploit data potential through the use of self-supervision within each modality; multi-view supervision across modalities; and nearest-neighbor supervision from other similar pairs.
MoPro: Webly Supervised Learning with Momentum Prototypes
- Computer ScienceICLR
- 2021
We propose a webly-supervised representation learning method that does not suffer from the annotation unscalability of supervised learning, nor the computation unscalability of self-supervised…
Better Self-training for Image Classification through Self-supervision
- Computer ScienceAI
- 2022
Empirical results show that applying self-supervision only in the first iteration of self-training can greatly improve accuracy, for a modest increase in computation time.
Rethinking Pre-training and Self-training
- Computer ScienceNeurIPS
- 2020
Self-training works well exactly on the same setup that pre-training does not work (using ImageNet to help COCO), and on the PASCAL segmentation dataset, though pre- training does help significantly, self-training improves upon the pre-trained model.
LiT: Zero-Shot Transfer with Locked-image text Tuning
- Computer Science2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
- 2022
In the empirical study, it is found that locked pre-trained image mod-els with unlocked text models work best, and the proposed LiT model achieves 84.5% zero-shot trans-fer accuracy on the ImageNet test set, and 81.1% on the challenging out-of-distribution Object net test set.
Debiased Learning from Naturally Imbalanced Pseudo-Labels for Zero-Shot and Semi-Supervised Learning
- Computer ScienceArXiv
- 2022
To eliminate the model bias, this work proposes a simple yet effective method DebiasMatch, comprising of an adaptive debiasing module and an adaptive marginal loss, which significantly outperforms previous state-of-the-arts learning tasks.
Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision
- Computer ScienceICML
- 2021
A simple dual-encoder architecture learns to align visual and language representations of the image and text pairs using a contrastive loss, and it is shown that the scale of the corpus can make up for its noise and leads to state-of-the-art representations even with such a simple learning scheme.
CLIP-Adapter: Better Vision-Language Models with Feature Adapters
- Computer ScienceArXiv
- 2021
This paper shows that there is an alternative path to achieve better vision-language models other than prompt tuning, and proposes CLIP-Adapter to conduct fine-tuning with feature adapters on either visual or language branch.
A Survey of Zero-Shot Learning
- Computer ScienceACM Trans. Intell. Syst. Technol.
- 2019
This paper categorizes existing zero-shot learning methods and introduces representative methods under each category, and highlights promising future research directions of zero- shot learning.