Learn More
We propose a simple, yet effective approach for spa-tiotemporal feature learning using deep 3-dimensional con-volutional networks (3D ConvNets) trained on a large scale supervised video dataset. Our findings are threefold: 1) 3D ConvNets are more suitable for spatiotemporal feature learning compared to 2D ConvNets; 2) A homogeneous architecture with small 3(More)
Structured output learning has been successfully applied to object localization, where the mapping between an image and an object bounding box can be well captured. Its extension to action localization in videos, however, is much more challenging, because we need to predict the locations of the action patterns both spatially and temporally, i.e.,(More)
Although sliding window-based approaches have been quite successful in detecting objects in images, it is not a trivial problem to extend them to detecting events in videos. We propose to search for spatiotemporal paths for video event detection. This new formulation can accurately detect and locate video events in cluttered and crowded scenes, and is(More)
  • Du Tran, Junsong Yuan, David Citation Forsyth, D Tran, Yuan, Forsyth +3 others
  • 2016
Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. The published version is(More)
Over the last few years deep learning methods have emerged as one of the most prominent approaches for video analysis. However, so far their most successful applications have been in the area of video classification and detection, i.e., problems involving the prediction of a single class label or a handful of output variables per video. Furthermore , while(More)
There is a widespread agreement that future technology for organizing, browsing and searching videos hinges on the development of methods for high-level semantic understanding of video. But, so far the community has not reached to a consensus on the best way to train and assess models for this task. Casting video understanding as a form of action or event(More)
In this paper we present EXMOVES—learned exemplar-based features for efficient recognition and analysis of actions in videos. The entries in our descriptor are produced by evaluating a set of movement classifiers over spatial-temporal volumes of the input video sequences. Each movement classifier is a simple exemplar-SVM trained on low-level features, i.e.,(More)