• Publications
  • Influence
Learning Spatiotemporal Features with 3D Convolutional Networks
The learned features, namely C3D (Convolutional 3D), with a simple linear classifier outperform state-of-the-art methods on 4 different benchmarks and are comparable with current best methods on the other 2 benchmarks.
A Closer Look at Spatiotemporal Convolutions for Action Recognition
A new spatiotemporal convolutional block "R(2+1)D" is designed which produces CNNs that achieve results comparable or superior to the state-of-the-art on Sports-1M, Kinetics, UCF101, and HMDB51.
C3D: Generic Features for Video Analysis
Convolution 3D feature is proposed, a generic spatio-temporal feature obtained by training a deep 3-dimensional convolutional network on a large annotated video dataset comprising objects, scenes, actions, and other frequently occurring concepts that encapsulate appearance and motion cues and perform well on different video classification tasks.
Video Classification With Channel-Separated Convolutional Networks
It is empirically demonstrated that the amount of channel interactions plays an important role in the accuracy of 3D group convolutional networks, and this leads to an architecture -- Channel-Separated Convolutional Network (CSN) -- which is simple, efficient, yet accurate.
Human Activity Recognition with Metric Learning
This paper proposes a metric learning based approach for human activity recognition with two main objectives: (1) reject unfamiliar activities and (2) learn with few examples. We show that our
ConvNet Architecture Search for Spatiotemporal Feature Learning
This paper presents an empirical ConvNet architecture search for spatiotemporal feature learning, culminating in a deep 3D Residual ConvNet that outperforms C3D by a good margin on Sports-1M, UCF101, HMDB51, THUMOS14, and ASLAN while being 2 times faster at inference time, 2 times smaller in model size, and having a more compact representation.
Detect-and-Track: Efficient Pose Estimation in Videos
This paper addresses the problem of estimating and tracking human body keypoints in complex, multi-person video. We propose an extremely lightweight yet highly effective approach that builds upon the
Self-Supervised Learning by Cross-Modal Audio-Video Clustering
Cross-Modal Deep Clustering (XDC), a novel self-supervised method that leverages unsupervised clustering in one modality as a supervisory signal for the other modality, is proposed, which is the first self- supervised learning method that outperforms large-scale fully- Supervised pretraining for action recognition on the same architecture.
Cooperative Learning of Audio and Video Models from Self-Supervised Synchronization
It is demonstrated that a calibrated curriculum learning scheme, a careful choice of negative examples, and the use of a contrastive loss are critical ingredients to obtain powerful multi-sensory representations from models optimized to discern temporal synchronization of audio-video pairs.
SCSampler: Sampling Salient Clips From Video for Efficient Action Recognition
It is demonstrated that the computational cost of action recognition on untrimmed videos can be dramatically reduced by invoking recognition only on these most salient clips, and this yields significant gains in recognition accuracy compared to analysis of all clips or randomly selected clips.