Align and Prompt: Video-and-Language Pre-training with Entity Prompts

  title={Align and Prompt: Video-and-Language Pre-training with Entity Prompts},
  author={Dongxu Li and Junnan Li and Hongdong Li and Juan Carlos Niebles and Steven C. H. Hoi},
  journal={2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  • Dongxu LiJunnan Li S. Hoi
  • Published 17 December 2021
  • Computer Science
  • 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Yidco-and-language pre-training has shown promising improvements on various downstream tasks. Most previous methods capture cross-modal interactions with a standard transformer-based multimodal encoder, not fully addressing the misalignment between unimodal video and text features. Besides, learning finegrained visual-language alignment usually requires off-the-shelf object detectors to provide object information, which is bottlenecked by the detector's limited vocabulary and expensive… 

Spatiotemporally Discriminative Video-Language Pre-Training with Text Grounding

A simple yet effective video-language pre-training framework to learn discriminative spatiotemporal features, namely G-ViLM, that performs favorably against existing approaches on four representative downstream tasks, covering text-video retrieval, video question answering, video action recognition and temporal action localization.

Clover: Towards A Unified Video-Language Alignment and Fusion Model

Clover establishes new state-of-the-arts on multiple downstream tasks, including three retrieval tasks for both zero-shot and fine-tuning settings, and eight video question answering tasks, and improves cross-modal feature alignment and fusion via a novel tri- modal alignment pre-training task.

Temporal Perceiving Video-Language Pre-training

Comprehensive experimental results show that this method significantly improves the state-of-the-art performance on various benchmarks, covering text-to-video retrieval, video question answering, video captioning, temporal action localization and temporal moment retrieval.

LiteVL: Efficient Video-Language Learning with Enhanced Spatial-Temporal Modeling

LiteVL is proposed, which adapts a pre-trained image-language model BLIP into a video-text model directly on downstream tasks, without heavy pre-training, and outperforms previous video-language pre- trained models by a clear margin.

HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training

A Hierarchical Temporal-Aware video-language pre-training framework, HiTeA, with two novel pre- training tasks for modeling cross-modal alignment between moments and texts as well as the temporal relations of video-text pairs is proposed.

SMAUG: Sparse Masked Autoencoder for Efficient Video-Language Pre-training

This paper develops SMAUG, an efficient pre-training framework for video-language models, and introduces a space-time token sparsification module, which leverages context information to further select only"important"spatial regions and temporal frames for pre- training.

mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections

mPLUG is a new vision-language foundation model for both cross-modal understanding and generation that achieves state-of-the-art results on a wide range of vision- language downstream tasks, including image captioning, image-text retrieval, visual grounding and visual question answering.

STOA-VLP: Spatial-Temporal Modeling of Object and Action for Video-Language Pre-training

STOA-VLP is proposed, a pre-training framework that jointly models object and action information across spatial and temporal dimensions, and regards object trajectories across frames and multiple action features from the video as fine-grained features.

CLOP: Video-and-Language Pre-Training with Knowledge Regularizations

This work proposes a Cross-modaL knOwledge-enhanced Pre-training (CLOP) method with Knowledge Regularizations, demonstrating the value of incorporating knowledge regularizations into video-and-language pre-training.

Language Models with Image Descriptors are Strong Few-Shot Video-Language Learners

The goal of this work is to build flexible video-language models that can generalize to various video-to-text tasks from few examples, such as domain-specific captioning, question answering, and future event prediction, and to demonstrate the power of language models in understanding videos on a wide variety of video- language tasks.

UniViLM: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation

Experimental results demonstrate that the UniVL can learn strong video-text representation and achieves state-of-the-art results on five downstream tasks.

Align before Fuse: Vision and Language Representation Learning with Momentum Distillation

A contrastive loss to ALign the image and text representations BEfore Fusing (ALBEF) them through cross-modal attention, which enables more grounded vision and language representation learning and proposes momentum distillation, a self-training method which learns from pseudo-targets produced by a momentum model.

Hero: Hierarchical Encoder for Video+Language Omni-representation Pre-training

HELP, a novel framework for large-scale video+language omni-representation learning that achieves new state of the art on multiple benchmarks over Text-based Video/Video-moment Retrieval, Video Question Answering (QA), Video-and-language Inference and Video Captioning tasks across different domains is presented.

Less is More: CLIPBERT for Video-and-Language Learning via Sparse Sampling

Experiments on text-to-video retrieval and video question answering on six datasets demonstrate that CLIPBERT outperforms (or is on par with) existing methods that exploit full-length videos, suggesting that end- to-end learning with just a few sparsely sampled clips is often more accurate than using densely extracted offline features from full- length videos, proving the proverbial less-is-more principle.

End-to-End Learning of Visual Representations From Uncurated Instructional Videos

This work proposes a new learning approach, MIL-NCE, capable of addressing mis- alignments inherent in narrated videos and outperforms all published self-supervised approaches for these tasks as well as several fully supervised baselines.

UNITER: UNiversal Image-TExt Representation Learning

UNITER, a UNiversal Image-TExt Representation, learned through large-scale pre-training over four image-text datasets is introduced, which can power heterogeneous downstream V+L tasks with joint multimodal embeddings.

VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding

VideoCLIP is presented, a contrastive approach to pre-train a unified model for zero-shot video and text understanding, without using any labels on downstream tasks, revealing state-of-the-art performance, surpassing prior work, and in some cases even outperforming supervised approaches.

ActBERT: Learning Global-Local Video-Text Representations

  • Linchao ZhuYi Yang
  • Computer Science
    2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
  • 2020
This paper introduces ActBERT for self-supervised learning of joint video-text representations from unlabeled data and introduces an ENtangled Transformer block to encode three sources of information, i.e., global actions, local regional objects, and linguistic descriptions.

VideoBERT: A Joint Model for Video and Language Representation Learning

This work builds upon the BERT model to learn bidirectional joint distributions over sequences of visual and linguistic tokens, derived from vector quantization of video data and off-the-shelf speech recognition outputs, respectively, which can be applied directly to open-vocabulary classification.

Learning a Text-Video Embedding from Incomplete and Heterogeneous Data

This work proposes a Mixture-of-Embedding-Experts (MEE) model with ability to handle missing input modalities during training and demonstrates significant improvements and outperforms previously reported methods on both text-to-video and video- to-text retrieval tasks.