Align and Prompt: Video-and-Language Pre-training with Entity Prompts
@article{Li2021AlignAP, title={Align and Prompt: Video-and-Language Pre-training with Entity Prompts}, author={Dongxu Li and Junnan Li and Hongdong Li and Juan Carlos Niebles and Steven C. H. Hoi}, journal={2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, year={2021}, pages={4943-4953} }
Yidco-and-language pre-training has shown promising improvements on various downstream tasks. Most previous methods capture cross-modal interactions with a standard transformer-based multimodal encoder, not fully addressing the misalignment between unimodal video and text features. Besides, learning finegrained visual-language alignment usually requires off-the-shelf object detectors to provide object information, which is bottlenecked by the detector's limited vocabulary and expensive…
Figures and Tables from this paper
62 Citations
Spatiotemporally Discriminative Video-Language Pre-Training with Text Grounding
- Computer ScienceArXiv
- 2023
A simple yet effective video-language pre-training framework to learn discriminative spatiotemporal features, namely G-ViLM, that performs favorably against existing approaches on four representative downstream tasks, covering text-video retrieval, video question answering, video action recognition and temporal action localization.
Clover: Towards A Unified Video-Language Alignment and Fusion Model
- Computer ScienceArXiv
- 2022
Clover establishes new state-of-the-arts on multiple downstream tasks, including three retrieval tasks for both zero-shot and fine-tuning settings, and eight video question answering tasks, and improves cross-modal feature alignment and fusion via a novel tri- modal alignment pre-training task.
Temporal Perceiving Video-Language Pre-training
- Computer ScienceArXiv
- 2023
Comprehensive experimental results show that this method significantly improves the state-of-the-art performance on various benchmarks, covering text-to-video retrieval, video question answering, video captioning, temporal action localization and temporal moment retrieval.
LiteVL: Efficient Video-Language Learning with Enhanced Spatial-Temporal Modeling
- Computer ScienceEMNLP
- 2022
LiteVL is proposed, which adapts a pre-trained image-language model BLIP into a video-text model directly on downstream tasks, without heavy pre-training, and outperforms previous video-language pre- trained models by a clear margin.
HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training
- Computer ScienceArXiv
- 2022
A Hierarchical Temporal-Aware video-language pre-training framework, HiTeA, with two novel pre- training tasks for modeling cross-modal alignment between moments and texts as well as the temporal relations of video-text pairs is proposed.
SMAUG: Sparse Masked Autoencoder for Efficient Video-Language Pre-training
- Computer ScienceArXiv
- 2022
This paper develops SMAUG, an efficient pre-training framework for video-language models, and introduces a space-time token sparsification module, which leverages context information to further select only"important"spatial regions and temporal frames for pre- training.
mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections
- Computer ScienceEMNLP
- 2022
mPLUG is a new vision-language foundation model for both cross-modal understanding and generation that achieves state-of-the-art results on a wide range of vision- language downstream tasks, including image captioning, image-text retrieval, visual grounding and visual question answering.
STOA-VLP: Spatial-Temporal Modeling of Object and Action for Video-Language Pre-training
- Computer ScienceArXiv
- 2023
STOA-VLP is proposed, a pre-training framework that jointly models object and action information across spatial and temporal dimensions, and regards object trajectories across frames and multiple action features from the video as fine-grained features.
CLOP: Video-and-Language Pre-Training with Knowledge Regularizations
- Computer ScienceACM Multimedia
- 2022
This work proposes a Cross-modaL knOwledge-enhanced Pre-training (CLOP) method with Knowledge Regularizations, demonstrating the value of incorporating knowledge regularizations into video-and-language pre-training.
Language Models with Image Descriptors are Strong Few-Shot Video-Language Learners
- Computer ScienceNeurIPS
- 2022
The goal of this work is to build flexible video-language models that can generalize to various video-to-text tasks from few examples, such as domain-specific captioning, question answering, and future event prediction, and to demonstrate the power of language models in understanding videos on a wide variety of video- language tasks.
61 References
UniViLM: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation
- Computer ScienceArXiv
- 2020
Experimental results demonstrate that the UniVL can learn strong video-text representation and achieves state-of-the-art results on five downstream tasks.
Align before Fuse: Vision and Language Representation Learning with Momentum Distillation
- Computer ScienceNeurIPS
- 2021
A contrastive loss to ALign the image and text representations BEfore Fusing (ALBEF) them through cross-modal attention, which enables more grounded vision and language representation learning and proposes momentum distillation, a self-training method which learns from pseudo-targets produced by a momentum model.
Hero: Hierarchical Encoder for Video+Language Omni-representation Pre-training
- Computer ScienceEMNLP
- 2020
HELP, a novel framework for large-scale video+language omni-representation learning that achieves new state of the art on multiple benchmarks over Text-based Video/Video-moment Retrieval, Video Question Answering (QA), Video-and-language Inference and Video Captioning tasks across different domains is presented.
Less is More: CLIPBERT for Video-and-Language Learning via Sparse Sampling
- Computer Science2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
- 2021
Experiments on text-to-video retrieval and video question answering on six datasets demonstrate that CLIPBERT outperforms (or is on par with) existing methods that exploit full-length videos, suggesting that end- to-end learning with just a few sparsely sampled clips is often more accurate than using densely extracted offline features from full- length videos, proving the proverbial less-is-more principle.
End-to-End Learning of Visual Representations From Uncurated Instructional Videos
- Computer Science2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
- 2020
This work proposes a new learning approach, MIL-NCE, capable of addressing mis- alignments inherent in narrated videos and outperforms all published self-supervised approaches for these tasks as well as several fully supervised baselines.
UNITER: UNiversal Image-TExt Representation Learning
- Computer ScienceECCV
- 2020
UNITER, a UNiversal Image-TExt Representation, learned through large-scale pre-training over four image-text datasets is introduced, which can power heterogeneous downstream V+L tasks with joint multimodal embeddings.
VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding
- Computer ScienceEMNLP
- 2021
VideoCLIP is presented, a contrastive approach to pre-train a unified model for zero-shot video and text understanding, without using any labels on downstream tasks, revealing state-of-the-art performance, surpassing prior work, and in some cases even outperforming supervised approaches.
ActBERT: Learning Global-Local Video-Text Representations
- Computer Science2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
- 2020
This paper introduces ActBERT for self-supervised learning of joint video-text representations from unlabeled data and introduces an ENtangled Transformer block to encode three sources of information, i.e., global actions, local regional objects, and linguistic descriptions.
VideoBERT: A Joint Model for Video and Language Representation Learning
- Computer Science2019 IEEE/CVF International Conference on Computer Vision (ICCV)
- 2019
This work builds upon the BERT model to learn bidirectional joint distributions over sequences of visual and linguistic tokens, derived from vector quantization of video data and off-the-shelf speech recognition outputs, respectively, which can be applied directly to open-vocabulary classification.
Learning a Text-Video Embedding from Incomplete and Heterogeneous Data
- Computer ScienceArXiv
- 2018
This work proposes a Mixture-of-Embedding-Experts (MEE) model with ability to handle missing input modalities during training and demonstrates significant improvements and outperforms previously reported methods on both text-to-video and video- to-text retrieval tasks.