Corpus ID: 219964325

X-LXMERT: Paint, Caption and Answer Questions with Multi-Modal Transformers

  title={X-LXMERT: Paint, Caption and Answer Questions with Multi-Modal Transformers},
  author={Jaemin Cho and Jiasen Lu and D. Schwenk and Hannaneh Hajishirzi and Aniruddha Kembhavi},
  • Jaemin Cho, Jaemin Cho, +3 authors Aniruddha Kembhavi
  • Published 2020
  • Computer Science
  • ArXiv
  • Mirroring the success of masked language models, vision-and-language counterparts like ViLBERT, LXMERT and UNITER have achieved state of the art performance on a variety of multimodal discriminative tasks like visual question answering and visual grounding. Recent work has also successfully adapted such models towards the generative task of image captioning. This begs the question: Can these models go the other way and generate images from pieces of text? Our analysis of a popular… CONTINUE READING
    Emerging Trends of Multimodal Research in Vision and Language


    Publications referenced by this paper.
    LXMERT: Learning Cross-Modality Encoder Representations from Transformers
    • 147
    • Highly Influential
    • Open Access
    Unified Vision-Language Pre-Training for Image Captioning and VQA
    • 55
    • Open Access
    Visual7W: Grounded Question Answering in Images
    • 417
    • Open Access
    UNITER: Learning UNiversal Image-TExt Representations
    • 83
    • Highly Influential
    In Defense of Grid Features for Visual Question Answering
    • 5
    • Highly Influential
    • Open Access
    12-in-1: Multi-Task Vision and Language Representation Learning
    • 20
    • Open Access
    Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training
    • 69
    • Open Access
    AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks
    • 380
    • Highly Influential
    • Open Access
    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
    • 9,879
    • Open Access