X-LXMERT: Paint, Caption and Answer Questions with Multi-Modal Transformers

@article{Cho2020XLXMERTPC,
  title={X-LXMERT: Paint, Caption and Answer Questions with Multi-Modal Transformers},
  author={Jaemin Cho and Jiasen Lu and Dustin Schwenk and Hannaneh Hajishirzi and Aniruddha Kembhavi},
  journal={ArXiv},
  year={2020},
  volume={abs/2009.11278}
}
Mirroring the success of masked language models, vision-and-language counterparts like ViLBERT, LXMERT and UNITER have achieved state of the art performance on a variety of multimodal discriminative tasks like visual question answering and visual grounding. Recent work has also successfully adapted such models towards the generative task of image captioning. This begs the question: Can these models go the other way and generate images from pieces of text? Our analysis of a popular… 
Unifying Vision-and-Language Tasks via Text Generation
TLDR
This work proposes a unified framework that learns different tasks in a single architecture with the same language modeling objective, i.e., multimodal conditional text generation, where the models learn to generate labels in text based on the visual and textual inputs.
DALL-Eval: Probing the Reasoning Skills and Social Biases of Text-to-Image Generative Transformers
TLDR
It is shown that recent text-to-image generative transformer models perform better in recognizing and counting objects than recognizing colors and understanding spatial relations, while there exists a large gap between the model performances and upper bound accuracy on all skills.
DiMBERT: Learning Vision-Language Grounded Representations with Disentangled Multimodal-Attention
TLDR
The proposed DiM (short for Disentangled Multimodal-Attention) module can be easily incorporated into existing pre-trained V-L models to boost their performance, up to a 5% increase on the representative task, and the effectiveness of the introduced visual concepts is demonstrated.
DU-VLG: Unifying Vision-and-Language Generation via Dual Sequence-to-Sequence Pre-training
TLDR
DU-VLG, a framework which unifies vision-and-language generation as sequence generation problems, is proposed and a novel commitment loss is designed to bridge the gap between image understanding and generation.
ERNIE-ViLG: Unified Generative Pre-training for Bidirectional Vision-Language Generation
TLDR
This paper proposes ERNIE-ViLG, a unified generative pre-training framework for bidirectional image-text generation with transformer model, and proposes an end-to-end training method to jointly learn the visual sequence generator and the image reconstructor.
Benchmark for Compositional Text-to-Image Synthesis
TLDR
This work presents the first systematic study of text-to-image generation on zero-shot compositional splits targeting two scenarios, unseen object-color and object-shape phrases, and proposes a new metric, based on a powerful vision-and-language CLIP model, which is leverage to compute R-Precision.
Vision-and-Language Pretraining
TLDR
This article categorize and delineate pretraining approaches, along with the summary of state-of-the-art vision-and-language pretrained models, and a list of training datasets and downstream tasks is supplied to further polish the perspective into V&L pretraining.
Scaling Autoregressive Models for Content-Rich Text-to-Image Generation
TLDR
The Pathways Autoregressive Text-to-Image (Parti) model is presented, which generates high-fidelity photorealistic images and supports content-rich synthesis involving complex compositions and world knowledge and explores and highlights limitations of the models.
Integrating Visuospatial, Linguistic, and Commonsense Structure into Story Visualization
TLDR
This paper first explores the use of constituency parse trees using a Transformer-based recurrent architecture for encoding structured input, and shows that off-the-shelf dense-captioning models trained on Visual Genome can improve the spatial structure of images from a different target domain without needing fine-tuning.
WenLan: Bridging Vision and Language by Large-Scale Multi-Modal Pre-Training
TLDR
This work proposes a twotower pre-training model called BriVL within the crossmodal contrastive learning framework, and devise a more advanced algorithm by adapting the latest method MoCo into the cross-modal scenario.
...
...

References

SHOWING 1-10 OF 73 REFERENCES
ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks
We present ViLBERT (short for Vision-and-Language BERT), a model for learning task-agnostic joint representations of image content and natural language. We extend the popular BERT architecture to a
Unified Vision-Language Pre-Training for Image Captioning and VQA
TLDR
VLP is the first reported model that achieves state-of-the-art results on both vision-language generation and understanding tasks, as disparate as image captioning and visual question answering, across three challenging benchmark datasets: COCO Captions, Flickr30k Captions and VQA 2.0.
Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering
TLDR
This work balances the popular VQA dataset by collecting complementary images such that every question in the authors' balanced dataset is associated with not just a single image, but rather a pair of similar images that result in two different answers to the question.
Visual7W: Grounded Question Answering in Images
TLDR
A semantic link between textual descriptions and image regions by object-level grounding enables a new type of QA with visual answers, in addition to textual answers used in previous work, and proposes a novel LSTM model with spatial attention to tackle the 7W QA tasks.
UNITER: Learning UNiversal Image-TExt Representations
TLDR
UNITER, a UNiversal Image-TExt Representation, learned through large-scale pre-training over four image-text datasets is introduced, which can power heterogeneous downstream V+L tasks with joint multimodal embeddings.
Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training
TLDR
After pretraining on large-scale image-caption pairs, Unicoder-VL is transferred to caption-based image-text retrieval and visual commonsense reasoning, with just one additional output layer, and shows the powerful ability of the cross-modal pre-training.
In Defense of Grid Features for Visual Question Answering
TLDR
This paper revisits grid features for VQA, and finds they can work surprisingly well -- running more than an order of magnitude faster with the same accuracy (e.g. if pre-trained in a similar fashion).
12-in-1: Multi-Task Vision and Language Representation Learning
TLDR
This work develops a large-scale, multi-task model that culminates in a single model on 12 datasets from four broad categories of task including visual question answering, caption-based image retrieval, grounding referring expressions, and multimodal verification and shows that finetuning task-specific models from this model can lead to further improvements, achieving performance at or above the state-of-the-art.
AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks
TLDR
An Attentional Generative Adversarial Network that allows attention-driven, multi-stage refinement for fine-grained text-to-image generation and for the first time shows that the layered attentional GAN is able to automatically select the condition at the word level for generating different parts of the image.
Text2Scene: Generating Compositional Scenes From Textual Descriptions
TLDR
Text2Scene is a model that generates various forms of compositional scene representations from natural language descriptions that is not only competitive when compared with state-of-the-art GAN-based methods using automatic metrics and superior based on human judgments but also has the advantage of producing interpretable results.
...
...