• Corpus ID: 248496406

Combined Scaling for Open-Vocabulary Image Classification

@inproceedings{Pham2021CombinedSF,
  title={Combined Scaling for Open-Vocabulary Image Classification},
  author={Hieu Pham and Zihang Dai and Golnaz Ghiasi and Kenji Kawaguchi and Hanxiao Liu and Adams Wei Yu and Jiahui Yu and Yi-Ting Chen and Minh-Thang Luong and Yonghui Wu and Mingxing Tan and Quoc V. Le},
  year={2021}
}
We present a combined scaling method – named BASIC – that achieves 85.7% top-1 accuracy on the ImageNet ILSVRC-2012 validation set without learning from any labeled ImageNet example. This accuracy surpasses best-published similar models – CLIP and ALIGN – by 9.3%. Our BASIC model also shows significant improvements in robustness benchmarks. For instance, on 5 test sets with natural distribution shifts such as ImageNet-{A,R,V2,Sketch} and ObjectNet, our model achieves 84.3% top-1 average accuracy… 
CoCa: Contrastive Captioners are Image-Text Foundation Models
TLDR
A minimalist design to pretrain an image-text encoder-decoder foundation model jointly with contrastive loss and captioning loss, thereby subsuming model capabilities from contrastive approaches like CLIP and generative methods like SimVLM.

References

SHOWING 1-10 OF 110 REFERENCES
High-Performance Large-Scale Image Recognition Without Normalization
TLDR
An adaptive gradient clipping technique is developed which overcomes instabilities in batch normalization, and a significantly improved class of Normalizer-Free ResNets is designed which attain significantly better performance when finetuning on ImageNet.
Evaluation of output embeddings for fine-grained image classification
TLDR
This project shows that compelling classification performance can be achieved on fine-grained categories even without labeled training data, and establishes a substantially improved state-of-the-art on the Animals with Attributes and Caltech-UCSD Birds datasets.
Do ImageNet Classifiers Generalize to ImageNet?
TLDR
The results suggest that the accuracy drops are not caused by adaptivity, but by the models' inability to generalize to slightly "harder" images than those found in the original test sets.
Exploring the Limits of Weakly Supervised Pretraining
TLDR
This paper presents a unique study of transfer learning with large convolutional networks trained to predict hashtags on billions of social media images and shows improvements on several image classification and object detection tasks, and reports the highest ImageNet-1k single-crop, top-1 accuracy to date.
Revisiting ResNets: Improved Training and Scaling Strategies
TLDR
It is found that training and scaling strategies may matter more than architectural changes, and further, that the resulting ResNets match recent state-of-the-art models.
CoAtNet: Marrying Convolution and Attention for All Data Sizes
TLDR
CoAtNets (pronounced “coat” nets), a family of hybrid models built from two key insights: depthwise Convolution and self-Attention can be naturally unified via simple relative attention and vertically stacking convolution layers and attention layers in a principled way is surprisingly effective in improving generalization, capacity and efficiency.
Large scale image annotation: learning to rank with joint word-image embeddings
TLDR
This work proposes a strongly performing method that scales to image annotation datasets by simultaneously learning to optimize precision at k of the ranked list of annotations for a given image and learning a low-dimensional joint embedding space for both images and annotations.
Fixing the train-test resolution discrepancy
TLDR
It is experimentally validated that, for a target test resolution, using a lower train resolution offers better classification at test time, and a simple yet effective and efficient strategy to optimize the classifier performance when the train and test resolutions differ is proposed.
DeViSE: A Deep Visual-Semantic Embedding Model
TLDR
This paper presents a new deep visual-semantic embedding model trained to identify visual objects using both labeled image data as well as semantic information gleaned from unannotated text and shows that the semantic information can be exploited to make predictions about tens of thousands of image labels not observed during training.
Label-Embedding for Image Classification
TLDR
This work proposes to view attribute-based image classification as a label-embedding problem: each class is embedded in the space of attribute vectors, and introduces a function that measures the compatibility between an image and a label embedding.
...
1
2
3
4
5
...