CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment

  title={CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment},
  author={Hongwei Xue and Yuchong Sun and Bei Liu and Jianlong Fu and Rui Song and Houqiang Li and Jiebo Luo},
The pre-trained image-text models, like CLIP, have demonstrated the strong power of vision-language representation learned from a large scale of web-collected image-text data. In light of the well-learned visual features, some existing works transfer image representation to video domain and achieve good results. However, how to utilize image-language pre-trained model (e.g., CLIP) for video-language pre-training (post-pretraining) is still under ex-plored. In this paper, we investigate two… 

Figures and Tables from this paper

Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners

This work presents VideoCoCa, an efficient approach to establish a foundational video-text model for tasks including open-vocabulary video classification, text-to-video retrieval, video captioning and video question-answering, and explores lightweight finetuning on top of this model.

X2-VLM: All-In-One Pre-trained Model For Vision-Language Tasks

The proposed multi-grained vision language pretraining approach is advanced by unifying image and video encoding in one model and scaling up the model with large-scale data, resulting in X 2 -VLM, a pre-trained VLM with a modular architecture for both image-text and video-text tasks.

Learning Video Representations from Large Language Models

This work repurposes pre-trained LLMs to be conditioned on visual input, and finetune them to create automatic video narrators, which offer a number of advantages, including dense cover-age of long videos, better temporal synchronization of the visual information and text, and much higher diversity of text.

VindLU: A Recipe for Effective Video-and-Language Pretraining

A thorough empirical study demystifying the most important factors in the VidL model design and develops a step-by-step recipe, dubbed V IND LU, for effective VidL pretraining, which achieves comparable or better than state-of-the-art results on several VidL tasks without relying on ex-ternal CLIP pretraining.

MIST: Multi-modal Iterative Spatial-Temporal Transformer for Long-form Video Question Answering

A new model named M ulti-modal I terative S patial-temporal T ransformer (MIST) is introduced to better adapt pre-trained models for long-form VideoQA and achieves state-of-the-art performance and is superior at computation and interpretability.

Stare at What You See: Masked Image Modeling without Reconstruction

The experimental results demonstrate that masked modeling does not lose effectiveness even without reconstruction on masked regions, and an efficient MIM paradigm named MaskAlign can achieve state-of-the-art performance with much higher ef ficiency.

Test of Time: Instilling Video-Language Models with a Sense of Time

This paper proposes a temporal adaptation recipe on top of one video-language model, VideoCLIP, based on post-pretraining on a small amount of video-text data, and conducts a zero-shot evaluation of the adapted models on six datasets for three downstream tasks which require a varying degree of time awareness.



CLIP2Video: Mastering Video-Text Retrieval via Image CLIP

ClIP2Video network is presented to transfer the image-language pre-training model to video-text retrieval in an end-to-end manner, and achieves state-of-the-art performance on major text- to-video and video-to -text retrieval benchmarks, including new records of retrieval accuracy on MSR-VTT, MSVD and VATEX.

UNITER: UNiversal Image-TExt Representation Learning

UNITER, a UNiversal Image-TExt Representation, learned through large-scale pre-training over four image-text datasets is introduced, which can power heterogeneous downstream V+L tasks with joint multimodal embeddings.

CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval

Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions

A novel High-resolution and Diversified VIdeo-LAnguage pre-training model (HD-VILA) for many visual tasks that outperform SOTA models with relative increases and achieves new state-of-the-art results in 10 VL understanding tasks and 2 more novel text-to-visual generation tasks.

Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts

The results clearly illustrate the benefit of scaling up pre-training data for vision-and-language tasks, as indicated by the new state-of-the-art results on both the nocaps and Conceptual Captions benchmarks.

Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks

This paper proposes a new learning method Oscar (Object-Semantics Aligned Pre-training), which uses object tags detected in images as anchor points to significantly ease the learning of alignments.

Improving Video-Text Retrieval by Multi-Stream Corpus Alignment and Dual Softmax Loss

A multi-stream Corpus Alignment network with single gate Mixture-of-Experts (CAMoE) and a novel Dual Softmax Loss (DSL) to solve the two heterogeneity of VTR task presently, and each of them is capable of achieving State- of-The-Art (SOTA) individually on various benchmarks.

ClipCap: CLIP Prefix for Image Captioning

This paper uses CLIP encoding as a prefix to the caption, by employing a simple mapping network, and then fine-tunes a language model to generate the image captions, allowing a lighter architecture with less trainable parameters.

Less is More: CLIPBERT for Video-and-Language Learning via Sparse Sampling

Experiments on text-to-video retrieval and video question answering on six datasets demonstrate that CLIPBERT outperforms (or is on par with) existing methods that exploit full-length videos, suggesting that end- to-end learning with just a few sparsely sampled clips is often more accurate than using densely extracted offline features from full- length videos, proving the proverbial less-is-more principle.

UniViLM: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation

Experimental results demonstrate that the UniVL can learn strong video-text representation and achieves state-of-the-art results on five downstream tasks.