C3D: Generic Features for Video Analysis
@article{Tran2014C3DGF, title={C3D: Generic Features for Video Analysis}, author={Du Tran and Lubomir D. Bourdev and Rob Fergus and Lorenzo Torresani and Manohar Paluri}, journal={ArXiv}, year={2014}, volume={abs/1412.0767} }
Videos have become ubiquitous due to the ease of capturing and sharing via social platforms like Youtube, Facebook, Instagram, and others. The computer vision community has tried to tackle various video analysis problems independently. As a consequence, even though some really good hand-crafted features have been proposed there is a lack of generic features for video analysis. On the other hand, the image domain has progressed rapidly by using features from deep convolutional networks. These…
Figures and Tables from this paper
373 Citations
Efficient Large Scale Video Classification
- Computer ScienceArXiv
- 2015
This work proposes two models for frame-level and video-level classification, the first is a highly efficient mixture of experts while the latter is based on long short term memory neural networks.
Handcrafted Local Features are Convolutional Neural Networks
- Computer ScienceArXiv
- 2015
This paper proposes a two-stream Convolutional ISA (ConvISA) that adopts the convolution-pooling structure of the state-of-the-art handcrafted video feature with greater modeling capacities and a cost-effective training algorithm.
VidSage: Unsupervised Video Representational Learning with Graph Convolutional Networks
- Computer Science
- 2019
This work proposed ”VidSage”, a system that transforms the input video into a generic representation in an unsupervised and self-supervised fashion, which obtained 54% and 28% classification accuracy on Charade, and Moments in Time datasets, outperforming previous unsuper supervised methods by 9% and 17% respectively, and on par with the recent meta-learning-based work by Google.
ECO: Efficient Convolutional Network for Online Video Understanding
- Computer ScienceECCV
- 2018
A network architecture that takes long-term content into account and enables fast per-video processing at the same time and achieves competitive performance across all datasets while being 10 to 80 times faster than state-of-the-art methods.
Attention Transfer from Web Images for Video Recognition
- Computer ScienceACM Multimedia
- 2017
This work proposes a novel approach to transfer knowledge from image domain to video domain, and designs a novel Siamese EnergyNet structure to learn energy functions on the attention maps by jointly optimizing two loss functions, such that the attention map corresponding to a ground truth concept would have higher energy.
Compact CNN for indexing egocentric videos
- Computer Science2016 IEEE Winter Conference on Applications of Computer Vision (WACV)
- 2016
A compact 3D Convolutional Neural Network architecture for long-term activity recognition in egocentric videos and a novel visualization of CNN kernels as flow fields to better understand what the network actually learns is proposed.
Dynamic scene classification using convolutional neural networks
- Computer Science2016 IEEE Global Conference on Signal and Information Processing (GlobalSIP)
- 2016
This paper analyzes the performance of statistical aggregation techniques on various pre-trained convolutional neural network models to address the problem of dynamic scene classification and shows that the proposed approach performs better than the-state-of-the art works for the Maryland and YUPenn dataset.
Real-Time Action Recognition with Enhanced Motion Vector CNNs
- Computer Science2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
- 2016
This paper accelerates the deep two-stream architecture by replacing optical flow with motion vector which can be obtained directly from compressed videos without extra calculation, and introduces three strategies for this, initialization transfer, supervision transfer and their combination.
(Deep) Learning from Frames
- Computer Science2016 5th Brazilian Conference on Intelligent Systems (BRACIS)
- 2016
A novel classification method that encapsulates multiple distinct ConvNets to perform genre classification, namely CoNNECT, where each ConvNet learns features that capture distinct aspects from the movie frames, significantly outperforms the state-of-the-art approaches in this task.
Order-aware Convolutional Pooling for Video Based Action Recognition
- Computer SciencePattern Recognit.
- 2019
References
SHOWING 1-10 OF 56 REFERENCES
Two-Stream Convolutional Networks for Action Recognition in Videos
- Computer ScienceNIPS
- 2014
This work proposes a two-stream ConvNet architecture which incorporates spatial and temporal networks and demonstrates that a ConvNet trained on multi-frame dense optical flow is able to achieve very good performance in spite of limited training data.
Large-Scale Video Classification with Convolutional Neural Networks
- Computer Science2014 IEEE Conference on Computer Vision and Pattern Recognition
- 2014
This work studies multiple approaches for extending the connectivity of a CNN in time domain to take advantage of local spatio-temporal information and suggests a multiresolution, foveated architecture as a promising way of speeding up the training.
Bag of visual words and fusion methods for action recognition: Comprehensive study and good practice
- Computer ScienceComput. Vis. Image Underst.
- 2016
Dynamic Scene Classification: Learning Motion Descriptors with Slow Features Analysis
- Computer Science2013 IEEE Conference on Computer Vision and Pattern Recognition
- 2013
This paper introduces a learned local motion descriptor which represents the principal and more stable motion components of training videos and integrates the authors' local motion feature into a global coding/pooling architecture in order to provide an effective signature for each video sequence.
PANDA: Pose Aligned Networks for Deep Attribute Modeling
- Computer Science2014 IEEE Conference on Computer Vision and Pattern Recognition
- 2014
A new method which combines part-based models and deep learning by training pose-normalized CNNs for inferring human attributes from images of people under large variation of viewpoint, pose, appearance, articulation and occlusion is proposed.
DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition
- Computer ScienceICML
- 2014
DeCAF, an open-source implementation of deep convolutional activation features, along with all associated network parameters, are released to enable vision researchers to be able to conduct experimentation with deep representations across a range of visual concept learning paradigms.
Learning Deep Features for Scene Recognition using Places Database
- Computer ScienceNIPS
- 2014
A new scene-centric database called Places with over 7 million labeled pictures of scenes is introduced with new methods to compare the density and diversity of image datasets and it is shown that Places is as dense as other scene datasets and has more diversity.
3D Convolutional Neural Networks for Human Action Recognition
- Computer ScienceIEEE Transactions on Pattern Analysis and Machine Intelligence
- 2013
A novel 3D CNN model for action recognition that extracts features from both the spatial and the temporal dimensions by performing 3D convolutions, thereby capturing the motion information encoded in multiple adjacent frames.
Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation
- Computer Science2014 IEEE Conference on Computer Vision and Pattern Recognition
- 2014
This paper proposes a simple and scalable detection algorithm that improves mean average precision (mAP) by more than 30% relative to the previous best result on VOC 2012 -- achieving a mAP of 53.3%.
Action recognition by dense trajectories
- Computer ScienceCVPR 2011
- 2011
This work introduces a novel descriptor based on motion boundary histograms, which is robust to camera motion and consistently outperforms other state-of-the-art descriptors, in particular in uncontrolled realistic videos.