Few-Shot Video Classification via Temporal Alignment

  title={Few-Shot Video Classification via Temporal Alignment},
  author={Kaidi Cao and Jingwei Ji and Zhangjie Cao and C. Chang and Juan Carlos Niebles},
  journal={2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
Difficulty in collecting and annotating large-scale video data raises a growing interest in learning models which can recognize novel classes with only a few training examples. In this paper, we propose the Ordered Temporal Alignment Module (OTAM), a novel few-shot learning framework that can learn to classify a previously unseen video. While most previous work neglects long-term temporal ordering information, our proposed model explicitly leverages the temporal ordering information in video… 

Figures and Tables from this paper

Learning Implicit Temporal Alignment for Few-shot Video Classification

This work introduces an implicit temporal alignment for a video pair, capable of estimating the similarity between them in an accurate and robust manner, and designs an effective context encoding module to incorporate spatial and feature channel context, resulting in better modeling of intra-class variations.

A Closer Look at Few-Shot Video Classification: A New Baseline and Benchmark

This paper proposes a simple classifier-based baseline without any temporal alignment that surprisingly outperforms the state-of-the-art meta-learning based methods and presents a new benchmark with more base data to facilitate future few-shot video classification without pre-training.

Generalized Few-Shot Video Classification With Video Retrieval and Feature Generation

This work argues that previous methods underestimate the importance of video feature learning and proposes to learn spatiotemporal features using a 3D CNN and a two-stage approach that learns video features on base classes followed by fine-tuning the classifiers on novel classes.

Label Independent Memory for Semi-Supervised Few-Shot Video Classification

  • Linchao ZhuYi Yang
  • Computer Science
    IEEE Transactions on Pattern Analysis and Machine Intelligence
  • 2022
A label independent memory (LIM) to cache label related features, which enables a similarity search over a large set of videos, and produces a class prototype for few-shot training, which is more robust to noisy video features.

TNT: Text-Conditioned Network with Transductive Inference for Few-Shot Video Classification

This paper formulate a text-based task conditioner to adapt video features to the few-shot learning task and follows a transductive setting to improve the task-adaptation ability of the model by using the support textual descriptions and query instances to update a set of class prototypes.

TAEN: Temporal Aware Embedding Network for Few-Shot Action Recognition

A Temporal Aware Embedding Network (TAEN) for few-shot action recognition, that learns to represent actions, in a metric space as a trajectory, conveying both short term semantics and longer term connectivity between sub-actions.

Less than Few: Self-Shot Video Instance Segmentation

This work proposes to automatically learn to find appropriate support videos given a query to bypass the need for labelled examples in few-shot video understanding at run time, and outlines a simple self-supervised learning method to generate an embedding space well-suited for unsupervised retrieval of relevant samples.

Temporal Alignment Prediction for Few-Shot Video Classification

Temporal Alignment Prediction (TAP) based on sequence similarity learning for few-shot video classification is proposed and its superiority over state-of-the-art methods is verified.

Few-Shot Learning for Video Object Detection in a Transfer-Learning Scheme

This paper defines the few-shot setting and creates a new benchmark dataset for few- shot video object detection derived from the widely used ImageNet VID dataset, and employs a transfer-learning framework to effectively train the video object detector on a large number of base- class objects and a few video clips of novel-class objects.

Few-Shot Video Object Detection

Extensive experiments demonstrate that theFSVOD method produces significantly better detection results on two few-shot video object detection datasets compared to image-based methods and other naive video-based extensions.



Metric-Based Few-Shot Learning for Video Action Recognition

This work addresses the task of few-shot video action recognition with a set of two-stream models, and finds prototypical networks and pooled long short-term memory network embeddings to give the best performance as few- shot method and video encoder, respectively.

A Closer Look at Few-shot Classification

The results reveal that reducing intra-class variation is an important factor when the feature backbone is shallow, but not as critical when using deeper backbones, and a baseline method with a standard fine-tuning practice compares favorably against other state-of-the-art few-shot learning algorithms.

TARN: Temporal Attentive Relation Network for Few-Shot and Zero-Shot Action Recognition

The proposed TARN uses attention mechanisms so as to perform temporal alignment, and learns a deep-distance measure on the aligned representations at video segment level to achieve competitive results in zero-shot action recognition.

Learning to Compare: Relation Network for Few-Shot Learning

A conceptually simple, flexible, and general framework for few-shot learning, where a classifier must learn to recognise new classes given only few examples from each, which is easily extended to zero- shot learning.

Compound Memory Networks for Few-Shot Video Classification

A multi-saliency embedding algorithm which encodes a variable-length video sequence into a fixed-size matrix representation by discovering multiple saliencies of interest is introduced.

ECO: Efficient Convolutional Network for Online Video Understanding

A network architecture that takes long-term content into account and enables fast per-video processing at the same time and achieves competitive performance across all datasets while being 10 to 80 times faster than state-of-the-art methods.

Temporal Segment Networks: Towards Good Practices for Deep Action Recognition

Deep convolutional networks have achieved great success for visual recognition in still images. However, for action recognition in videos, the advantage over traditional methods is not so evident.

Large-Scale Video Classification with Convolutional Neural Networks

This work studies multiple approaches for extending the connectivity of a CNN in time domain to take advantage of local spatio-temporal information and suggests a multiresolution, foveated architecture as a promising way of speeding up the training.

Learning Temporal Action Proposals With Fewer Labels

This work proposes a semi-supervised learning algorithm specifically designed for training temporal action proposal networks and shows that this approach consistently matches or outperforms the fully supervised state-of-the-art approaches.