Flamingo: a Visual Language Model for Few-Shot Learning

@article{Alayrac2022FlamingoAV,
  title={Flamingo: a Visual Language Model for Few-Shot Learning},
  author={Jean-Baptiste Alayrac and Jeff Donahue and Pauline Luc and Antoine Miech and Iain Barr and Yana Hasson and Karel Lenc and Arthur Mensch and Katie Millican and Malcolm Reynolds and Roman Ring and Eliza Rutherford and Serkan Cabi and Tengda Han and Zhitao Gong and Sina Samangooei and Marianne Monteiro and Jacob Menick and Sebastian Borgeaud and Andy Brock and Aida Nematzadeh and Sahand Sharifzadeh and Mikolaj Binkowski and Ricardo Barreira and Oriol Vinyals and Andrew Zisserman and Karen Simonyan},
  journal={ArXiv},
  year={2022},
  volume={abs/2204.14198}
}
Building models that can be rapidly adapted to numerous tasks using only a handful of annotated examples is an open challenge for multimodal machine learning research. We introduce Flamingo, a family of Visual Language Models (VLM) with this ability. Flamingo models include key architectural innovations to: (i) bridge powerful pretrained vision-only and language-only models, (ii) handle sequences of arbitrarily interleaved visual and textual data, and (iii) seamlessly ingest images or videos as… 
Language Models with Image Descriptors are Strong Few-Shot Video-Language Learners
TLDR
The goal of this work is to build flexible video-language models that can generalize to various video-to-text tasks from few examples, such as domain-specific captioning, question answering, and future event prediction, which outperforms state-of-the-art supervised models trained on any video datasets.
Multimodal Knowledge Alignment with Reinforcement Learning
TLDR
This work proposes ESPER, a novel approach to reinforcement learning which extends language-only zero-shot models to unseen multimodal tasks, like image and audio captioning, and demonstrates that it outperforms baselines and prior work on a variety of zero- shot tasks.
GIT: A Generative Image-to-text Transformer for Vision and Language
TLDR
This paper designs and train a GIT to unify vision-language tasks such as image/video captioning and question answering and presents a new scheme of generation-based image classification and scene text recognition, achieving decent performance on standard benchmarks.
Rethinking Evaluation Practices in Visual Question Answering: A Case Study on Out-of-Distribution Generalization
TLDR
This work comprehensively investigates performance of two pretrained V&L models under different settings by conducting cross-dataset evaluations and argues that in most cases generative models are less susceptible to shifts in data distribution, while frequently performing better on tested benchmarks.
VL-BEiT: Generative Vision-Language Pretraining
TLDR
A vision-language foundation model called VL-BE I T, which is a bidirectional multimodal Transformer learned by generative pretraining, is introduced, which effectively leverages monomodal data like images and texts as well as multimodals data like image-text pairs.
Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone
TLDR
This work presents FIBER (Fusion-In-the-Backbone-based transformER), a new VL model architecture that can seamlessly handle both these types of tasks and provides consistent performance improvements over strong baselines across all tasks, often outperforming methods using magnitudes more data.
Multimodal Masked Autoencoders Learn Transferable Representations
TLDR
This paper proposes a simple and scalable network architecture, the Multimodal Masked Autoencoder (M3AE), which learns a unified encoder for both vision and language data via masked token prediction, and demonstrates the scalability of M3AE with larger model size and training time, and its ability to learn generalizable representations that transfer well to downstream tasks.
Visual Clues: Bridging Vision and Language Foundations for Image Paragraph Captioning
TLDR
This work argues that by using visual clues to bridge large pretrained vision foundation models and language models, they can do so without any extra cross-modal training.
Language Models are General-Purpose Interfaces
TLDR
This work proposes to use language models as a general-purpose interface to various foundation models to jointly pretrain the interface and the modular encoders, and subsume the advantages and capabilities from both causal and non-causal modeling.
A Unified Sequence Interface for Vision Tasks
TLDR
This work shows that a diverse set of “core” computer vision tasks can also be unified if formulated in terms of a shared pixelto-sequence interface, and shows that one can train a neural network with a single model architecture and loss function on all these tasks, with no task-specific customization.
...
...

References

SHOWING 1-10 OF 168 REFERENCES
Multimodal Few-Shot Learning with Frozen Language Models
TLDR
The resulting system is a multimodal few-shot learner, with the surprising ability to learn a variety of new tasks when conditioned on examples, represented as a sequence of multiple interleaved image and text embeddings.
Learning to Prompt for Vision-Language Models
TLDR
Context Optimization (CoOp) is proposed, a simple approach specifically for adapting CLIP-like vision-language models for downstream image recognition that requires as few as one or two shots to beat hand-crafted prompts with a decent margin and is able to gain significant improvements when using more shots.
Unifying Vision-and-Language Tasks via Text Generation
TLDR
This work proposes a unified framework that learns different tasks in a single architecture with the same language modeling objective, i.e., multimodal conditional text generation, where the models learn to generate labels in text based on the visual and textual inputs.
VideoBERT: A Joint Model for Video and Language Representation Learning
TLDR
This work builds upon the BERT model to learn bidirectional joint distributions over sequences of visual and linguistic tokens, derived from vector quantization of video data and off-the-shelf speech recognition outputs, respectively, which can be applied directly to open-vocabulary classification.
VisualGPT: Data-efficient Adaptation of Pretrained Language Models for Image Captioning
TLDR
This work proposes VisualGPT, which employs a novel self-resurrecting encoder-decoder attention mechanism to quickly adapt the PLM with a small amount of in-domain image-text data and achieves the state-of-the-art result on IU X-ray, a medical report generation dataset.
Learning Transferable Visual Models From Natural Language Supervision
TLDR
It is demonstrated that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet.
Unified Vision-Language Pre-Training for Image Captioning and VQA
TLDR
VLP is the first reported model that achieves state-of-the-art results on both vision-language generation and understanding tasks, as disparate as image captioning and visual question answering, across three challenging benchmark datasets: COCO Captions, Flickr30k Captions and VQA 2.0.
Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language
TLDR
This work shows that model diversity is symbiotic, and can be leveraged to build AI systems with structured Socratic dialogue – in which new multimodal tasks are formulated as a guided language- based exchange between different pre-existing foundation models, without additional language-based exchange.
BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
TLDR
BLIP effectively utilizes the noisy web data by bootstrapping the captions, where a captioner generates synthetic captions and a filter removes the noisy ones, and demonstrates strong generalization ability when directly transferred to videolanguage tasks in a zero-shot manner.
VL-Adapter: Parameter-Efficient Transfer Learning for Vision-and-Language Tasks
TLDR
This paper introduces adapter-based parameter-efficient transfer learning techniques to V&L models such as VL-BART and Vl-T5 and demonstrates that training the adapter with the weight-sharing technique can match the performance of the entire model.
...
...