X-LXMERT: Paint, Caption and Answer Questions with Multi-Modal Transformers

@inproceedings{Cho2020XLXMERTPC,
  title={X-LXMERT: Paint, Caption and Answer Questions with Multi-Modal Transformers},
  author={Jaemin Cho and Jiasen Lu and Dustin Schwenk and Hannaneh Hajishirzi and Aniruddha Kembhavi},
  booktitle={EMNLP},
  year={2020}
}
Mirroring the success of masked language models, vision-and-language counterparts like ViLBERT, LXMERT and UNITER have achieved state of the art performance on a variety of multimodal discriminative tasks like visual question answering and visual grounding. Recent work has also successfully adapted such models towards the generative task of image captioning. This begs the question: Can these models go the other way and generate images from pieces of text? Our analysis of a popular… Expand
Unifying Vision-and-Language Tasks via Text Generation
Emerging Trends of Multimodal Research in Vision and Language
UFC-BERT: Unifying Multi-Modal Controls for Conditional Image Synthesis
Heterogeneous Attention Network for Effective and Efficient Cross-modal Retrieval
  • Tan Yu, Yi Yang, Yi Li, Lin Liu, Hongliang Fei, Ping Li
  • Computer Science
  • SIGIR
  • 2021
Towards General Purpose Vision Systems
Txt2Vid: Ultra-Low Bitrate Compression of Talking-Head Videos via Text
...
1
2
...

References

SHOWING 1-10 OF 73 REFERENCES
ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks
LXMERT: Learning Cross-Modality Encoder Representations from Transformers
Unified Vision-Language Pre-Training for Image Captioning and VQA
Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering
Visual7W: Grounded Question Answering in Images
UNITER: Learning UNiversal Image-TExt Representations
In Defense of Grid Features for Visual Question Answering
12-in-1: Multi-Task Vision and Language Representation Learning
AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks
...
1
2
3
4
5
...