Transforming CLIP to an Open-vocabulary Video Model via Interpolated Weight Optimization

  title={Transforming CLIP to an Open-vocabulary Video Model via Interpolated Weight Optimization},
  author={Zejia Weng and Xitong Yang and Ang Li and Zuxuan Wu and Yu-Gang Jiang},
Contrastive Language-Image Pretraining (CLIP) has demonstrated impressive zero-shot learning abilities for image understanding, yet limited effort has been made to investigate CLIP for zero-shot video recognition. We introduce Open-VCLIP, a simple yet effective approach that transforms CLIP into a strong zero-shot video classifier that can recognize unseen actions and events at test time. Our framework extends CLIP with minimal modifications to model spatial-temporal relationships in videos… 

Figures and Tables from this paper

Expanding Language-Image Pretrained Models for General Video Recognition

This work presents a simple yet effective approach that adapts the pretrained language-image models to video recognition directly, instead of pretraining a new model from scratch, and proposes a cross-frame attention mechanism that explicitly exchanges information across frames.

ActionCLIP: A New Paradigm for Video Action Recognition

An instantiation of the new paradigm, ActionCLIP, which not only has superior and flexible zero-shot/fewshot transfer ability but also reaches a top performance on general action recognition task, achieving 83.8% top-1 accuracy on Kinetics-400 with a ViT-B/16 as the backbone.

Rethinking Zero-Shot Video Classification: End-to-End Training for Realistic Applications

This work proposes the first end-to-end algorithm for ZSL in video classification, which uses a trainable 3D CNN to learn the visual features and outperforms the state-of-the-art by a wide margin.

Patching open-vocabulary models by interpolating weights

PAINT, a patching method that uses interpolations between the weights of a model before fine- Tuning and the weights after fine-tuning on a task to be patched, is introduced, demonstrating that it is possible to expand the set of tasks on which open-vocabulary models achieve high accuracy without re-training them from scratch.

Revisiting Classifier: Transferring Vision-Language Models for Video Recognition

This paper revise the role of the linear classifier and replace the classifier with the different knowledge from pre-trained model, and utilizes the well-pretrained language model to generate good semantic target for efficient transferring learning.

Elaborative Rehearsal for Zero-shot Action Recognition

    Shizhe ChenDong Huang
    Computer Science
    2021 IEEE/CVF International Conference on Computer Vision (ICCV)
  • 2021
This work proposes an ER-enhanced ZSAR model inspired by an effective human memory technique Elaborative Rehearsal (ER), which involves elaborating a new concept and relating it to known concepts and achieves state-of-the-art results on three existing benchmarks.

Overcoming catastrophic forgetting in neural networks

It is shown that it is possible to overcome the limitation of connectionist models and train networks that can maintain expertise on tasks that they have not experienced for a long time and selectively slowing down learning on the weights important for previous tasks.

HMDB: A large video database for human motion recognition

This paper uses the largest action video database to-date with 51 action categories, which in total contain around 7,000 manually annotated clips extracted from a variety of sources ranging from digitized movies to YouTube, to evaluate the performance of two representative computer vision systems for action recognition and explore the robustness of these methods under various conditions.

Crossmodal Representation Learning for Zero-shot Action Recognition

We present a cross-modal Transformer-based frame-work, which jointly encodes video data and text labels for zero-shot action recognition (ZSAR). Our model employs a conceptually new pipeline by which

Learning Transferable Visual Models From Natural Language Supervision

It is demonstrated that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet.