• Publications
  • Influence
Learning Spatiotemporal Features with 3D Convolutional Networks
We propose a simple, yet effective approach for spatiotemporal feature learning using deep 3-dimensional convolutional networks (3D ConvNets) trained on a large scale supervised video dataset. OurExpand
  • 3,438
  • 797
  • PDF
A Closer Look at Spatiotemporal Convolutions for Action Recognition
In this paper we discuss several forms of spatiotemporal convolutions for video analysis and study their effects on action recognition. Our motivation stems from the observation that 2D CNNs appliedExpand
  • 527
  • 98
  • PDF
C3D: Generic Features for Video Analysis
Videos have become ubiquitous due to the ease of capturing and sharing via social platforms like Youtube, Facebook, Instagram, and others. The computer vision community has tried to tackle variousExpand
  • 280
  • 53
  • PDF
Human Activity Recognition with Metric Learning
This paper proposes a metric learning based approach for human activity recognition with two main objectives: (1) reject unfamiliar activities and (2) learn with few examples. We show that ourExpand
  • 349
  • 31
  • PDF
ConvNet Architecture Search for Spatiotemporal Feature Learning
Learning image representations with ConvNets by pre-training on ImageNet has proven useful across many visual understanding tasks including object detection, semantic segmentation, and imageExpand
  • 174
  • 31
  • PDF
Detect-and-Track: Efficient Pose Estimation in Videos
This paper addresses the problem of estimating and tracking human body keypoints in complex, multi-person video. We propose an extremely lightweight yet highly effective approach that builds upon theExpand
  • 108
  • 19
  • PDF
Cooperative Learning of Audio and Video Models from Self-Supervised Synchronization
There is a natural correlation between the visual and auditive elements of a video. In this work we leverage this connection to learn general and effective models for both audio and video analysisExpand
  • 74
  • 12
  • PDF
Max-Margin Structured Output Regression for Spatio-Temporal Action Localization
Structured output learning has been successfully applied to object localization, where the mapping between an image and an object bounding box can be well captured. Its extension to actionExpand
  • 69
  • 12
  • PDF
Large-Scale Weakly-Supervised Pre-Training for Video Action Recognition
Current fully-supervised video datasets consist of only a few hundred thousand videos and fewer than a thousand domain-specific labels. This hinders the progress towards advanced video architectures.Expand
  • 58
  • 9
  • PDF
Optimal spatio-temporal path discovery for video event detection
We propose a novel algorithm for video event detection and localization as the optimal path discovery problem in spatio-temporal video space. By finding the optimal spatio-temporal path, our methodExpand
  • 57
  • 9
  • PDF