PERCEIVER-VL: Efficient Vision-and-Language Modeling with Iterative Latent Attention

  title={PERCEIVER-VL: Efficient Vision-and-Language Modeling with Iterative Latent Attention},
  author={Zineng Tang and Jaemin Cho and Jie Lei and Mohit Bansal},
  journal={2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)},
We present PERCEIVER-VL, a vision-and-language framework that efficiently handles high-dimensional multi- modal inputs such as long videos and text. Powered by the iterative latent-cross-attention of Perceiver, our framework scales with linear complexity, in contrast to the quadratic complexity of self-attention used in many state-of-the-art transformer-based models. To further improve the efficiency of our framework, we also study applying LayerDrop on cross-attention layers and introduce a… 

Unifying Vision, Text, and Layout for Universal Document Processing

This work proposes Universal Document Processing, a foundation Document AI model which unifies text, image, and layout modalities together with varied task formats, including document understanding and generation, and is the first time in the field of document AI that one model simultaneously achieves high-quality neural document editing and content customization.



ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision

A minimal VLP model, Vision-andLanguage Transformer (ViLT), monolithic in the sense that processing of visual inputs is drastically simplified to just the same convolution-free manner that the authors process textual inputs, showing that ViLT is up to 60 times faster than previous VLP models, yet with competitive or better downstream task performance.

Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training

After pretraining on large-scale image-caption pairs, Unicoder-VL is transferred to caption-based image-text retrieval and visual commonsense reasoning, with just one additional output layer, and shows the powerful ability of the cross-modal pre-training.

VinVL: Revisiting Visual Representations in Vision-Language Models

This paper develops an improved object detection model to provide object-centric representations of images and feeds the visual features generated into a Transformer-based VL fusion model OSCAR, and utilizes an improved approach OSCar+ to pre-train the VL model and fine-tune it on a wide range of downstream VL tasks.

Unified Vision-Language Pre-Training for Image Captioning and VQA

VLP is the first reported model that achieves state-of-the-art results on both vision-language generation and understanding tasks, as disparate as image captioning and visual question answering, across three challenging benchmark datasets: COCO Captions, Flickr30k Captions and VQA 2.0.

Hero: Hierarchical Encoder for Video+Language Omni-representation Pre-training

HELP, a novel framework for large-scale video+language omni-representation learning that achieves new state of the art on multiple benchmarks over Text-based Video/Video-moment Retrieval, Video Question Answering (QA), Video-and-language Inference and Video Captioning tasks across different domains is presented.

Perceiver: General Perception with Iterative Attention

This paper introduces the Perceiver – a model that builds upon Transformers and hence makes few architectural assumptions about the relationship between its inputs, but that also scales to hundreds of thousands of inputs, like ConvNets.

LXMERT: Learning Cross-Modality Encoder Representations from Transformers

The LXMERT (Learning Cross-Modality Encoder Representations from Transformers) framework, a large-scale Transformer model that consists of three encoders, achieves the state-of-the-art results on two visual question answering datasets and shows the generalizability of the pre-trained cross-modality model.

ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks

We present ViLBERT (short for Vision-and-Language BERT), a model for learning task-agnostic joint representations of image content and natural language. We extend the popular BERT architecture to a

Is Space-Time Attention All You Need for Video Understanding?

This paper presents a convolution-free approach to video classification built exclusively on self-attention over space and time, and suggests that “divided attention,” where temporal attention and spatial attention are separately applied within each block, leads to the best video classification accuracy among the design choices considered.

Less is More: CLIPBERT for Video-and-Language Learning via Sparse Sampling

Experiments on text-to-video retrieval and video question answering on six datasets demonstrate that CLIPBERT outperforms (or is on par with) existing methods that exploit full-length videos, suggesting that end- to-end learning with just a few sparsely sampled clips is often more accurate than using densely extracted offline features from full- length videos, proving the proverbial less-is-more principle.