• Corpus ID: 195346008

C3D: Generic Features for Video Analysis

  title={C3D: Generic Features for Video Analysis},
  author={Du Tran and Lubomir D. Bourdev and Rob Fergus and Lorenzo Torresani and Manohar Paluri},
Videos have become ubiquitous due to the ease of capturing and sharing via social platforms like Youtube, Facebook, Instagram, and others. The computer vision community has tried to tackle various video analysis problems independently. As a consequence, even though some really good hand-crafted features have been proposed there is a lack of generic features for video analysis. On the other hand, the image domain has progressed rapidly by using features from deep convolutional networks. These… 

Efficient Large Scale Video Classification

This work proposes two models for frame-level and video-level classification, the first is a highly efficient mixture of experts while the latter is based on long short term memory neural networks.

Handcrafted Local Features are Convolutional Neural Networks

This paper proposes a two-stream Convolutional ISA (ConvISA) that adopts the convolution-pooling structure of the state-of-the-art handcrafted video feature with greater modeling capacities and a cost-effective training algorithm.

VidSage: Unsupervised Video Representational Learning with Graph Convolutional Networks

This work proposed ”VidSage”, a system that transforms the input video into a generic representation in an unsupervised and self-supervised fashion, which obtained 54% and 28% classification accuracy on Charade, and Moments in Time datasets, outperforming previous unsuper supervised methods by 9% and 17% respectively, and on par with the recent meta-learning-based work by Google.

ECO: Efficient Convolutional Network for Online Video Understanding

A network architecture that takes long-term content into account and enables fast per-video processing at the same time and achieves competitive performance across all datasets while being 10 to 80 times faster than state-of-the-art methods.

Attention Transfer from Web Images for Video Recognition

This work proposes a novel approach to transfer knowledge from image domain to video domain, and designs a novel Siamese EnergyNet structure to learn energy functions on the attention maps by jointly optimizing two loss functions, such that the attention map corresponding to a ground truth concept would have higher energy.

Compact CNN for indexing egocentric videos

A compact 3D Convolutional Neural Network architecture for long-term activity recognition in egocentric videos and a novel visualization of CNN kernels as flow fields to better understand what the network actually learns is proposed.

Dynamic scene classification using convolutional neural networks

This paper analyzes the performance of statistical aggregation techniques on various pre-trained convolutional neural network models to address the problem of dynamic scene classification and shows that the proposed approach performs better than the-state-of-the art works for the Maryland and YUPenn dataset.

Real-Time Action Recognition with Enhanced Motion Vector CNNs

This paper accelerates the deep two-stream architecture by replacing optical flow with motion vector which can be obtained directly from compressed videos without extra calculation, and introduces three strategies for this, initialization transfer, supervision transfer and their combination.

(Deep) Learning from Frames

A novel classification method that encapsulates multiple distinct ConvNets to perform genre classification, namely CoNNECT, where each ConvNet learns features that capture distinct aspects from the movie frames, significantly outperforms the state-of-the-art approaches in this task.

Order-aware Convolutional Pooling for Video Based Action Recognition




Two-Stream Convolutional Networks for Action Recognition in Videos

This work proposes a two-stream ConvNet architecture which incorporates spatial and temporal networks and demonstrates that a ConvNet trained on multi-frame dense optical flow is able to achieve very good performance in spite of limited training data.

Large-Scale Video Classification with Convolutional Neural Networks

This work studies multiple approaches for extending the connectivity of a CNN in time domain to take advantage of local spatio-temporal information and suggests a multiresolution, foveated architecture as a promising way of speeding up the training.

Dynamic Scene Classification: Learning Motion Descriptors with Slow Features Analysis

This paper introduces a learned local motion descriptor which represents the principal and more stable motion components of training videos and integrates the authors' local motion feature into a global coding/pooling architecture in order to provide an effective signature for each video sequence.

PANDA: Pose Aligned Networks for Deep Attribute Modeling

A new method which combines part-based models and deep learning by training pose-normalized CNNs for inferring human attributes from images of people under large variation of viewpoint, pose, appearance, articulation and occlusion is proposed.

DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition

DeCAF, an open-source implementation of deep convolutional activation features, along with all associated network parameters, are released to enable vision researchers to be able to conduct experimentation with deep representations across a range of visual concept learning paradigms.

Learning Deep Features for Scene Recognition using Places Database

A new scene-centric database called Places with over 7 million labeled pictures of scenes is introduced with new methods to compare the density and diversity of image datasets and it is shown that Places is as dense as other scene datasets and has more diversity.

3D Convolutional Neural Networks for Human Action Recognition

A novel 3D CNN model for action recognition that extracts features from both the spatial and the temporal dimensions by performing 3D convolutions, thereby capturing the motion information encoded in multiple adjacent frames.

Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation

This paper proposes a simple and scalable detection algorithm that improves mean average precision (mAP) by more than 30% relative to the previous best result on VOC 2012 -- achieving a mAP of 53.3%.

Action recognition by dense trajectories

This work introduces a novel descriptor based on motion boundary histograms, which is robust to camera motion and consistently outperforms other state-of-the-art descriptors, in particular in uncontrolled realistic videos.