X-LXMERT: Paint, Caption and Answer Questions with Multi-Modal Transformers

@inproceedings{Cho2020XLXMERTPC,
  title={X-LXMERT: Paint, Caption and Answer Questions with Multi-Modal Transformers},
  author={Jaemin Cho and Jiasen Lu and Dustin Schwenk and Hannaneh Hajishirzi and Aniruddha Kembhavi},
  booktitle={EMNLP},
  year={2020}
}
Mirroring the success of masked language models, vision-and-language counterparts like ViLBERT, LXMERT and UNITER have achieved state of the art performance on a variety of multimodal discriminative tasks like visual question answering and visual grounding. Recent work has also successfully adapted such models towards the generative task of image captioning. This begs the question: Can these models go the other way and generate images from pieces of text? Our analysis of a popular… Expand
7 Citations
Unifying Vision-and-Language Tasks via Text Generation
  • 3
  • PDF
WenLan: Bridging Vision and Language by Large-Scale Multi-Modal Pre-Training
  • 1
  • PDF
Emerging Trends of Multimodal Research in Vision and Language
  • 1
Towards General Purpose Vision Systems
  • PDF
Zero-Shot Text-to-Image Generation
  • 15
  • PDF

References

SHOWING 1-10 OF 73 REFERENCES
ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks
  • 421
  • PDF
LXMERT: Learning Cross-Modality Encoder Representations from Transformers
  • 315
  • Highly Influential
  • PDF
Unified Vision-Language Pre-Training for Image Captioning and VQA
  • 121
  • PDF
Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering
  • 745
  • PDF
Visual7W: Grounded Question Answering in Images
  • 471
  • PDF
UNITER: Learning UNiversal Image-TExt Representations
  • 138
  • Highly Influential
In Defense of Grid Features for Visual Question Answering
  • 31
  • Highly Influential
  • PDF
12-in-1: Multi-Task Vision and Language Representation Learning
  • 74
  • PDF
Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training
  • 135
  • PDF
AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks
  • T. Xu, Pengchuan Zhang, +4 authors X. He
  • Computer Science
  • 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition
  • 2018
  • 499
  • Highly Influential
  • PDF
...
1
2
3
4
5
...