LXMERT: Learning Cross-Modality Encoder Representations from Transformers

@article{Tan2019LXMERTLC,
  title={LXMERT: Learning Cross-Modality Encoder Representations from Transformers},
  author={Hao Hao Tan and Mohit Bansal},
  journal={ArXiv},
  year={2019},
  volume={abs/1908.07490}
}
Vision-and-language reasoning requires an understanding of visual concepts, language semantics, and, most importantly, the alignment and relationships between these two modalities. [] Key Method Next, to endow our model with the capability of connecting vision and language semantics, we pre-train the model with large amounts of image-and-sentence pairs, via five diverse representative pre-training tasks: masked language modeling, masked object prediction (feature regression and label classification), cross…

Figures and Tables from this paper

Bridge-Tower: Building Bridges Between Encoders in Vision-Language Representation Learning
TLDR
The proposed BRIDGE-TOWER, pre-trained with only 4M images, achieves state-of-the-art performance on various downstream vision-language tasks and introduces multiple bridge layers that build a connection between the top layers of uni- modal encoders and each layer of the cross-modal encoder.
Probing Inter-modality: Visual Parsing with Self-Attention for Vision-Language Pre-training
TLDR
This work proposes a metric named Inter-Modality Flow (IMF) to measure the interaction between vision and language modalities (i.e., inter-modality) and design a novel masking optimization mechanism named Masked Feature Regression (MFR) in Transformer to further promote the inter- modality learning.
Distilled Dual-Encoder Model for Vision-Language Understanding
TLDR
A cross-modal attention distillation framework is proposed to train a dual-encoder model for vision-language understanding tasks, such as visual reasoning and visual question answering, which achieves competitive performance for visual reasoning, visual entailment andVisual question answering tasks while enjoying a much faster inference speed than fusion-encoding models.
VL-BEiT: Generative Vision-Language Pretraining
TLDR
A vision-language foundation model called VL-BE I T, which is a bidirectional multimodal Transformer learned by generative pretraining, is introduced, which effectively leverages monomodal data like images and texts as well as multimodals data like image-text pairs.
Pre-training Model Based on Parallel Cross-Modality Fusion Layer
TLDR
Experimental results on the platform of Visual Question Answering dataset VQA v2.0 show that the Pre-trained P-PCFL model has a good effect after fine-tuning the parameters, and ablation experiments are conducted and some results of Attention visualization are provided to verify the effectiveness of P- PCFL model.
UFO: A UniFied TransfOrmer for Vision-Language Representation Learning
TLDR
A single UniFied transfOrmer (UFO), which is capable of processing either unimodal inputs or multi-modality inputs, for vision-language (VL) representation learning, to achieve new state of the arts on visual question answering, COCO image captioning and nocaps.
Uni-EDEN: Universal Encoder-Decoder Network by Multi-Granular Vision-Language Pre-training
TLDR
A pre-trainable Universal Encoder-DEcoder Network (Uni-EDEN), consisting of three modules: object and sentence encoders that separately learns the representations of each modality and sentence decoder that enables both multi- modal reasoning and sentence generation via inter-modal interaction.
Align before Fuse: Vision and Language Representation Learning with Momentum Distillation
TLDR
This paper introduces a contrastive loss to ALign the image and text representations BEfore Fusing through cross-modal attention, which enables more grounded vision and language representation learning and proposes momentum distillation, a self-training method which learns from pseudo-targets produced by a momentum model.
Unifying Vision-and-Language Tasks via Text Generation
TLDR
This work proposes a unified framework that learns different tasks in a single architecture with the same language modeling objective, i.e., multimodal conditional text generation, where the models learn to generate labels in text based on the visual and textual inputs.
Parameter Efficient Multimodal Transformers for Video Representation Learning
TLDR
This work alleviate the high memory requirement from Transformers by sharing the weights of Transformers across layers and modalities; it decomposes the Transformer into modality-specific andmodality-shared parts so that the model learns the dynamics of each modality both individually and together, and proposes a novel parameter sharing scheme based on low-rank approximation.
...
...

References

SHOWING 1-10 OF 44 REFERENCES
ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks
We present ViLBERT (short for Vision-and-Language BERT), a model for learning task-agnostic joint representations of image content and natural language. We extend the popular BERT architecture to a
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
TLDR
A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.
Multimodal Unified Attention Networks for Vision-and-Language Interactions
TLDR
A general unified attention model is proposed that simultaneously captures the intra- and inter-modal interactions of multimodal features and outputs their corresponding attended representations and achieves top-level performance on both tasks without bells and whistles.
VideoBERT: A Joint Model for Video and Language Representation Learning
TLDR
This work builds upon the BERT model to learn bidirectional joint distributions over sequences of visual and linguistic tokens, derived from vector quantization of video data and off-the-shelf speech recognition outputs, respectively, which can be applied directly to open-vocabulary classification.
Attention is All you Need
TLDR
A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.
Visual7W: Grounded Question Answering in Images
TLDR
A semantic link between textual descriptions and image regions by object-level grounding enables a new type of QA with visual answers, in addition to textual answers used in previous work, and proposes a novel LSTM model with spatial attention to tackle the 7W QA tasks.
Multi-Modality Latent Interaction Network for Visual Question Answering
TLDR
The proposed Multi- modality Latent Interaction module (MLI) learns the cross-modality relationships between latent visual and language summarizations, which summarize visual regions and question into a small number of latent representations to avoid modeling uninformative individual region-word relations.
VisualBERT: A Simple and Performant Baseline for Vision and Language
TLDR
Analysis demonstrates that VisualBERT can ground elements of language to image regions without any explicit supervision and is even sensitive to syntactic relationships, tracking, for example, associations between verbs and image regions corresponding to their arguments.
Improving Language Understanding by Generative Pre-Training
TLDR
The general task-agnostic model outperforms discriminatively trained models that use architectures specifically crafted for each task, improving upon the state of the art in 9 out of the 12 tasks studied.
Dynamic Fusion With Intra- and Inter-Modality Attention Flow for Visual Question Answering
TLDR
A novel method of dynamically fuse multi-modal features with intra- and inter-modality information flow, which alternatively pass dynamic information between and across the visual and language modalities is proposed, which can robustly capture the high-level interactions between language and vision domains.
...
...