Florence: A New Foundation Model for Computer Vision
@article{Yuan2021FlorenceAN, title={Florence: A New Foundation Model for Computer Vision}, author={Lu Yuan and Dongdong Chen and Yi-Ling Chen and Noel C. F. Codella and Xiyang Dai and Jianfeng Gao and Houdong Hu and Xuedong Huang and Boxin Li and Chunyuan Li and Ce Liu and Mengchen Liu and Zicheng Liu and Yumao Lu and Yu Shi and Lijuan Wang and Jianfeng Wang and Bin Xiao and Zhen Xiao and Jianwei Yang and Michael Zeng and Luowei Zhou and Pengchuan Zhang}, journal={ArXiv}, year={2021}, volume={abs/2111.11432} }
Automated visual understanding of our diverse and open world demands computer vision models to generalize well with minimal customization for specific tasks, similar to human vision. Computer vision foundation models, which are trained on diverse, large-scale dataset and can be adapted to a wide range of downstream tasks, are critical for this mission to solve real-world computer vision applications. While existing vision foundation models such as CLIP (Radford et al., 2021), ALIGN (Jia et al…
Figures and Tables from this paper
27 Citations
Vision-Language Intelligence: Tasks, Representation Learning, and Large Models
- Computer ScienceArXiv
- 2022
The development in this field is summarized into three time periods, namely task-specific methods, vision-language pre-training (VLP) methods, and larger models empowered by large-scale weakly-labeled data.
K-LITE: Learning Transferable Visual Models with External Knowledge
- Computer ScienceArXiv
- 2022
This paper proposes K-L ITE, a simple strategy to leverage external knowledge to build transferable visual systems, and proposes knowledge-augmented models that show signs of improvement in transfer learning performance over existing methods.
Flamingo: a Visual Language Model for Few-Shot Learning
- BiologyArXiv
- 2022
It is demonstrated that a single Flamingo model can achieve a new state of the art for few-shot learning, simply by prompting the model with task-specific examples.
In Defense of Image Pre-Training for Spatiotemporal Recognition
- Computer ScienceArXiv
- 2022
The experiments show that simply replacing 3D convolution with STS notably improves a wide range of 3D CNNs without increasing parameters and computation on both Kinetics-400 and Something-Something V2, and this new training pipeline consistently achieves better results on video recognition with speedup.
"This is my unicorn, Fluffy": Personalizing frozen vision-language representations
- Computer ScienceArXiv
- 2022
This work proposes an architecture for solving PerVL that operates by extending the input vocabulary of a pretrained model with new word embeddings for the new personalized concepts, and demonstrates that this approach learns personalized visual concepts from a few examples and can effectively apply them in image retrieval and semantic segmentation using rich textual queries.
ELEVATER: A Benchmark and Toolkit for Evaluating Language-Augmented Visual Models
- Computer ScienceArXiv
- 2022
Novel language-aware initialization methods are proposed to significantly improve the adaption performance of language-augmented visual models and an automatic hyper-parameter tuning toolkit is developed to ensure the fairness in modelAdaption.
Masked Feature Prediction for Self-Supervised Visual Pre-Training
- Computer ScienceArXiv
- 2021
This work presents Masked Feature Prediction (MaskFeat), which first randomly masks out a portion of the input sequence and then predicts the feature of the masked regions, and finds Histograms of Oriented Gradients (HOG), a hand-crafted feature descriptor, works particularly well in terms of both performance and efficiency.
CoCa: Contrastive Captioners are Image-Text Foundation Models
- Computer ScienceArXiv
- 2022
A minimalist design to pretrain an image-text encoder-decoder foundation model jointly with contrastive loss and captioning loss, thereby subsuming model capabilities from contrastive approaches like CLIP and generative methods like SimVLM.
Multiview Transformers for Video Recognition
- Computer ScienceArXiv
- 2022
This work presents Multiview Transformers for Video Recognition (MTV), a model that consists of separate encoders to represent different views of the input video with lateral connections to fuse information across views and achieves state-of-the-art results on six standard datasets.
An Empirical Study of Training End-to-End Vision-and-Language Transformers
- Computer ScienceArXiv
- 2021
This paper systematically investigates how to design and pre-train a fully transformer-based VL model in an endto-end manner, and provides insights on how to train a performant VL transformer while maintaining fast inference speed.
References
SHOWING 1-10 OF 87 REFERENCES
VinVL: Revisiting Visual Representations in Vision-Language Models
- Computer Science2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
- 2021
This paper develops an improved object detection model to provide object-centric representations of images and feeds the visual features generated into a Transformer-based VL fusion model OSCAR, and utilizes an improved approach OSCar+ to pre-train the VL model and fine-tune it on a wide range of downstream VL tasks.
Learning Transferable Visual Models From Natural Language Supervision
- Computer ScienceICML
- 2021
It is demonstrated that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet.
Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations
- Computer ScienceInternational Journal of Computer Vision
- 2016
The Visual Genome dataset is presented, which contains over 108K images where each image has an average of $$35$$35 objects, $$26$$26 attributes, and $$21$$21 pairwise relationships between objects, and represents the densest and largest dataset of image descriptions, objects, attributes, relationships, and question answer pairs.
How Much Can CLIP Benefit Vision-and-Language Tasks?
- Computer ScienceArXiv
- 2021
It is shown that CLIP significantly outperforms widely-used visual encoders trained with in-domain annotated data, such as BottomUp-TopDown, and also establishes new state-of-the-art results on Visual Question Answering, Visual Entailment, and V&L Navigation tasks.
Align before Fuse: Vision and Language Representation Learning with Momentum Distillation
- Computer ScienceNeurIPS
- 2021
This paper introduces a contrastive loss to ALign the image and text representations BEfore Fusing through cross-modal attention, which enables more grounded vision and language representation learning and proposes momentum distillation, a self-training method which learns from pseudo-targets produced by a momentum model.
Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation
- Computer Science2014 IEEE Conference on Computer Vision and Pattern Recognition
- 2014
This paper proposes a simple and scalable detection algorithm that improves mean average precision (mAP) by more than 30% relative to the previous best result on VOC 2012 -- achieving a mAP of 53.3%.
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows
- Computer Science2021 IEEE/CVF International Conference on Computer Vision (ICCV)
- 2021
A hierarchical Transformer whose representation is computed with Shifted windows, which has the flexibility to model at various scales and has linear computational complexity with respect to image size and will prove beneficial for all-MLP architectures.
Scaling Vision Transformers
- Computer ScienceArXiv
- 2021
A ViT model with two billion parameters is successfully trained, which attains a new state-of-the-art on ImageNet of 90.45% top-1 accuracy and performs well on few-shot learning.
Do Better ImageNet Models Transfer Better?
- Computer Science2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
- 2019
It is found that, when networks are used as fixed feature extractors or fine-tuned, there is a strong correlation between ImageNet accuracy and transfer accuracy, and ImageNet features are less general than previously suggested.
Focal Self-attention for Local-Global Interactions in Vision Transformers
- Computer ScienceArXiv
- 2021
A new variant of Vision Transformer models, called Focal Transformer, is proposed, which achieves superior performance over the state-of-the-art (SoTA) vision Transformers on a range of public image classification and object detection benchmarks.