CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment
@article{Xue2022CLIPViPAP, title={CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment}, author={Hongwei Xue and Yuchong Sun and Bei Liu and Jianlong Fu and Rui Song and Houqiang Li and Jiebo Luo}, journal={ArXiv}, year={2022}, volume={abs/2209.06430} }
The pre-trained image-text models, like CLIP, have demonstrated the strong power of vision-language representation learned from a large scale of web-collected image-text data. In light of the well-learned visual features, some existing works transfer image representation to video domain and achieve good results. However, how to utilize image-language pre-trained model (e.g., CLIP) for video-language pre-training (post-pretraining) is still under ex-plored. In this paper, we investigate two…
Figures and Tables from this paper
7 Citations
Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners
- Computer ScienceArXiv
- 2022
This work presents VideoCoCa, an efficient approach to establish a foundational video-text model for tasks including open-vocabulary video classification, text-to-video retrieval, video captioning and video question-answering, and explores lightweight finetuning on top of this model.
X2-VLM: All-In-One Pre-trained Model For Vision-Language Tasks
- Computer ScienceArXiv
- 2022
The proposed multi-grained vision language pretraining approach is advanced by unifying image and video encoding in one model and scaling up the model with large-scale data, resulting in X 2 -VLM, a pre-trained VLM with a modular architecture for both image-text and video-text tasks.
Learning Video Representations from Large Language Models
- Computer ScienceArXiv
- 2022
This work repurposes pre-trained LLMs to be conditioned on visual input, and finetune them to create automatic video narrators, which offer a number of advantages, including dense cover-age of long videos, better temporal synchronization of the visual information and text, and much higher diversity of text.
VindLU: A Recipe for Effective Video-and-Language Pretraining
- Computer ScienceArXiv
- 2022
A thorough empirical study demystifying the most important factors in the VidL model design and develops a step-by-step recipe, dubbed V IND LU, for effective VidL pretraining, which achieves comparable or better than state-of-the-art results on several VidL tasks without relying on ex-ternal CLIP pretraining.
MIST: Multi-modal Iterative Spatial-Temporal Transformer for Long-form Video Question Answering
- Computer ScienceArXiv
- 2022
A new model named M ulti-modal I terative S patial-temporal T ransformer (MIST) is introduced to better adapt pre-trained models for long-form VideoQA and achieves state-of-the-art performance and is superior at computation and interpretability.
Stare at What You See: Masked Image Modeling without Reconstruction
- Computer ScienceArXiv
- 2022
The experimental results demonstrate that masked modeling does not lose effectiveness even without reconstruction on masked regions, and an efficient MIM paradigm named MaskAlign can achieve state-of-the-art performance with much higher ef ficiency.
Test of Time: Instilling Video-Language Models with a Sense of Time
- Computer ScienceArXiv
- 2023
This paper proposes a temporal adaptation recipe on top of one video-language model, VideoCLIP, based on post-pretraining on a small amount of video-text data, and conducts a zero-shot evaluation of the adapted models on six datasets for three downstream tasks which require a varying degree of time awareness.
References
SHOWING 1-10 OF 59 REFERENCES
CLIP2Video: Mastering Video-Text Retrieval via Image CLIP
- Computer ScienceArXiv
- 2021
ClIP2Video network is presented to transfer the image-language pre-training model to video-text retrieval in an end-to-end manner, and achieves state-of-the-art performance on major text- to-video and video-to -text retrieval benchmarks, including new records of retrieval accuracy on MSR-VTT, MSVD and VATEX.
UNITER: UNiversal Image-TExt Representation Learning
- Computer ScienceECCV
- 2020
UNITER, a UNiversal Image-TExt Representation, learned through large-scale pre-training over four image-text datasets is introduced, which can power heterogeneous downstream V+L tasks with joint multimodal embeddings.
CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval
- Computer ScienceNeurocomputing
- 2022
Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions
- Computer Science2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
- 2022
A novel High-resolution and Diversified VIdeo-LAnguage pre-training model (HD-VILA) for many visual tasks that outperform SOTA models with relative increases and achieves new state-of-the-art results in 10 VL understanding tasks and 2 more novel text-to-visual generation tasks.
Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts
- Computer Science2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
- 2021
The results clearly illustrate the benefit of scaling up pre-training data for vision-and-language tasks, as indicated by the new state-of-the-art results on both the nocaps and Conceptual Captions benchmarks.
Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks
- Computer ScienceECCV
- 2020
This paper proposes a new learning method Oscar (Object-Semantics Aligned Pre-training), which uses object tags detected in images as anchor points to significantly ease the learning of alignments.
Improving Video-Text Retrieval by Multi-Stream Corpus Alignment and Dual Softmax Loss
- Computer ScienceArXiv
- 2021
A multi-stream Corpus Alignment network with single gate Mixture-of-Experts (CAMoE) and a novel Dual Softmax Loss (DSL) to solve the two heterogeneity of VTR task presently, and each of them is capable of achieving State- of-The-Art (SOTA) individually on various benchmarks.
ClipCap: CLIP Prefix for Image Captioning
- Computer ScienceArXiv
- 2021
This paper uses CLIP encoding as a prefix to the caption, by employing a simple mapping network, and then fine-tunes a language model to generate the image captions, allowing a lighter architecture with less trainable parameters.
Less is More: CLIPBERT for Video-and-Language Learning via Sparse Sampling
- Computer Science2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
- 2021
Experiments on text-to-video retrieval and video question answering on six datasets demonstrate that CLIPBERT outperforms (or is on par with) existing methods that exploit full-length videos, suggesting that end- to-end learning with just a few sparsely sampled clips is often more accurate than using densely extracted offline features from full- length videos, proving the proverbial less-is-more principle.
UniViLM: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation
- Computer ScienceArXiv
- 2020
Experimental results demonstrate that the UniVL can learn strong video-text representation and achieves state-of-the-art results on five downstream tasks.