• Corpus ID: 244477674

Florence: A New Foundation Model for Computer Vision

@article{Yuan2021FlorenceAN,
  title={Florence: A New Foundation Model for Computer Vision},
  author={Lu Yuan and Dongdong Chen and Yi-Ling Chen and Noel C. F. Codella and Xiyang Dai and Jianfeng Gao and Houdong Hu and Xuedong Huang and Boxin Li and Chunyuan Li and Ce Liu and Mengchen Liu and Zicheng Liu and Yumao Lu and Yu Shi and Lijuan Wang and Jianfeng Wang and Bin Xiao and Zhen Xiao and Jianwei Yang and Michael Zeng and Luowei Zhou and Pengchuan Zhang},
  journal={ArXiv},
  year={2021},
  volume={abs/2111.11432}
}
Automated visual understanding of our diverse and open world demands computer vision models to generalize well with minimal customization for specific tasks, similar to human vision. Computer vision foundation models, which are trained on diverse, large-scale dataset and can be adapted to a wide range of downstream tasks, are critical for this mission to solve real-world computer vision applications. While existing vision foundation models such as CLIP (Radford et al., 2021), ALIGN (Jia et al… 
Vision-Language Intelligence: Tasks, Representation Learning, and Large Models
TLDR
The development in this field is summarized into three time periods, namely task-specific methods, vision-language pre-training (VLP) methods, and larger models empowered by large-scale weakly-labeled data.
K-LITE: Learning Transferable Visual Models with External Knowledge
TLDR
This paper proposes K-L ITE, a simple strategy to leverage external knowledge to build transferable visual systems, and proposes knowledge-augmented models that show signs of improvement in transfer learning performance over existing methods.
Flamingo: a Visual Language Model for Few-Shot Learning
TLDR
It is demonstrated that a single Flamingo model can achieve a new state of the art for few-shot learning, simply by prompting the model with task-specific examples.
In Defense of Image Pre-Training for Spatiotemporal Recognition
TLDR
The experiments show that simply replacing 3D convolution with STS notably improves a wide range of 3D CNNs without increasing parameters and computation on both Kinetics-400 and Something-Something V2, and this new training pipeline consistently achieves better results on video recognition with speedup.
"This is my unicorn, Fluffy": Personalizing frozen vision-language representations
TLDR
This work proposes an architecture for solving PerVL that operates by extending the input vocabulary of a pretrained model with new word embeddings for the new personalized concepts, and demonstrates that this approach learns personalized visual concepts from a few examples and can effectively apply them in image retrieval and semantic segmentation using rich textual queries.
ELEVATER: A Benchmark and Toolkit for Evaluating Language-Augmented Visual Models
TLDR
Novel language-aware initialization methods are proposed to significantly improve the adaption performance of language-augmented visual models and an automatic hyper-parameter tuning toolkit is developed to ensure the fairness in modelAdaption.
Masked Feature Prediction for Self-Supervised Visual Pre-Training
TLDR
This work presents Masked Feature Prediction (MaskFeat), which first randomly masks out a portion of the input sequence and then predicts the feature of the masked regions, and finds Histograms of Oriented Gradients (HOG), a hand-crafted feature descriptor, works particularly well in terms of both performance and efficiency.
CoCa: Contrastive Captioners are Image-Text Foundation Models
TLDR
A minimalist design to pretrain an image-text encoder-decoder foundation model jointly with contrastive loss and captioning loss, thereby subsuming model capabilities from contrastive approaches like CLIP and generative methods like SimVLM.
Multiview Transformers for Video Recognition
TLDR
This work presents Multiview Transformers for Video Recognition (MTV), a model that consists of separate encoders to represent different views of the input video with lateral connections to fuse information across views and achieves state-of-the-art results on six standard datasets.
An Empirical Study of Training End-to-End Vision-and-Language Transformers
TLDR
This paper systematically investigates how to design and pre-train a fully transformer-based VL model in an endto-end manner, and provides insights on how to train a performant VL transformer while maintaining fast inference speed.
...
1
2
3
...

References

SHOWING 1-10 OF 87 REFERENCES
VinVL: Revisiting Visual Representations in Vision-Language Models
TLDR
This paper develops an improved object detection model to provide object-centric representations of images and feeds the visual features generated into a Transformer-based VL fusion model OSCAR, and utilizes an improved approach OSCar+ to pre-train the VL model and fine-tune it on a wide range of downstream VL tasks.
Learning Transferable Visual Models From Natural Language Supervision
TLDR
It is demonstrated that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet.
Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations
TLDR
The Visual Genome dataset is presented, which contains over 108K images where each image has an average of $$35$$35 objects, $$26$$26 attributes, and $$21$$21 pairwise relationships between objects, and represents the densest and largest dataset of image descriptions, objects, attributes, relationships, and question answer pairs.
How Much Can CLIP Benefit Vision-and-Language Tasks?
TLDR
It is shown that CLIP significantly outperforms widely-used visual encoders trained with in-domain annotated data, such as BottomUp-TopDown, and also establishes new state-of-the-art results on Visual Question Answering, Visual Entailment, and V&L Navigation tasks.
Align before Fuse: Vision and Language Representation Learning with Momentum Distillation
TLDR
This paper introduces a contrastive loss to ALign the image and text representations BEfore Fusing through cross-modal attention, which enables more grounded vision and language representation learning and proposes momentum distillation, a self-training method which learns from pseudo-targets produced by a momentum model.
Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation
TLDR
This paper proposes a simple and scalable detection algorithm that improves mean average precision (mAP) by more than 30% relative to the previous best result on VOC 2012 -- achieving a mAP of 53.3%.
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows
  • Ze Liu, Yutong Lin, B. Guo
  • Computer Science
    2021 IEEE/CVF International Conference on Computer Vision (ICCV)
  • 2021
TLDR
A hierarchical Transformer whose representation is computed with Shifted windows, which has the flexibility to model at various scales and has linear computational complexity with respect to image size and will prove beneficial for all-MLP architectures.
Scaling Vision Transformers
TLDR
A ViT model with two billion parameters is successfully trained, which attains a new state-of-the-art on ImageNet of 90.45% top-1 accuracy and performs well on few-shot learning.
Do Better ImageNet Models Transfer Better?
TLDR
It is found that, when networks are used as fixed feature extractors or fine-tuned, there is a strong correlation between ImageNet accuracy and transfer accuracy, and ImageNet features are less general than previously suggested.
Focal Self-attention for Local-Global Interactions in Vision Transformers
TLDR
A new variant of Vision Transformer models, called Focal Transformer, is proposed, which achieves superior performance over the state-of-the-art (SoTA) vision Transformers on a range of public image classification and object detection benchmarks.
...
1
2
3
4
5
...