Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners

  title={Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners},
  author={Shen Yan and Tao Zhu and Zirui Wang and Yuan Cao and Mi Zhang and Soham Ghosh and Yonghui Wu and Jiahui Yu},
We explore an efficient approach to establish a foundational video-text model. We present VideoCoCa that maximally reuses a pretrained image-text contrastive captioner (CoCa) model and adapt it to video-text tasks with minimal extra training. While previous works adapt image-text models with various cross-frame fusion modules, we find that the generative attentional pooling and contrastive attentional pooling layers in CoCa are instantly adaptable to flattened frame embeddings, yielding state… 

Figures and Tables from this paper

mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video

Recent years have witnessed a big convergence of language, vision, and multi-modal pretraining. In this work, we present mPLUG-2, a new unified paradigm with modularized design for multi-modal



VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding

VideoCLIP is presented, a contrastive approach to pre-train a unified model for zero-shot video and text understanding, without using any labels on downstream tasks, revealing state-of-the-art performance, surpassing prior work, and in some cases even outperforming supervised approaches.

FitCLIP: Refining Large-Scale Pretrained Image-Text Models for Zero-Shot Video Understanding Tasks

This paper presents a fine-tuning strategy to refine these large-scale pretrained image-text models for zero-shot video understanding tasks and shows that by carefully adapting these models they obtain considerable improvements on two zero- shot Action Recognition tasks and three Text-to-video Retrieval tasks.

Learning Audio-Video Modalities from Image Captions

A new video mining pipeline is proposed which involves transferring captions from image captioning datasets to video clips with no additional manual effort, and it is shown that training a multimodal transformed based model on this data achieves competitive performance on video retrieval and video captioning.

A CLIP-Hitchhiker's Guide to Long Video Retrieval

It is found that the simple yet effective baseline of weighted-mean of frame embeddings via query-scoring is a significant improvement above all prior temporal modelling attempts and mean-pooling.

CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment

A Omnisource Cross-modal Learning method equipped with a Video Proxy mechanism on the basis of CLIP, namely CLIP-ViP, which improves the performance of ClIP on video-text retrieval by a large margin and achieves SOTA results on a variety of datasets.

NITS-VC System for VATEX Video Captioning Challenge 2020

An encoder-decoder based approach in which the visual features of the video are encoded using 3D convolutional neural network (C3D) and in the decoding phase two Long Short Term Memory (LSTM) recurrent networks are used in which visual features and input captions are fused separately and final output is generated by performing element-wise product between the output of both LSTMs.

HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips

It is demonstrated that a text-video embedding trained on this data leads to state-of-the-art results for text-to-video retrieval and action localization on instructional video datasets such as YouCook2 or CrossTask.

CoCa: Contrastive Captioners are Image-Text Foundation Models

A minimalist design to pretrain an image-text encoder-decoder foundation model jointly with contrastive loss and captioning loss, thereby subsuming model capabilities from contrastive approaches like CLIP and generative methods like SimVLM.

End-to-end Generative Pretraining for Multimodal Video Captioning

Multimodal Video Generative Pretraining (MV-GPT), a new pretraining framework for learning from unlabelled videos which can be effectively used for generative tasks such as multimodal video captioning, achieves state-of-the-art performance on four standard benchmarks.

Prompting Visual-Language Models for Efficient Video Understanding

This paper proposes to optimise a few random vectors, termed as “continuous prompt vectors”, that convert video-related tasks into the same format as the pre-training objectives, and exploits its powerful ability for resource-hungry video understanding tasks, with minimal training.