• Publications
  • Influence
Learning Spatiotemporal Features with 3D Convolutional Networks
The learned features, namely C3D (Convolutional 3D), with a simple linear classifier outperform state-of-the-art methods on 4 different benchmarks and are comparable with current best methods on the other 2 benchmarks.
A Closer Look at Spatiotemporal Convolutions for Action Recognition
A new spatiotemporal convolutional block "R(2+1)D" is designed which produces CNNs that achieve results comparable or superior to the state-of-the-art on Sports-1M, Kinetics, UCF101, and HMDB51.
Exploring the Limits of Weakly Supervised Pretraining
This paper presents a unique study of transfer learning with large convolutional networks trained to predict hashtags on billions of social media images and shows improvements on several image classification and object detection tasks, and reports the highest ImageNet-1k single-crop, top-1 accuracy to date.
PANDA: Pose Aligned Networks for Deep Attribute Modeling
A new method which combines part-based models and deep learning by training pose-normalized CNNs for inferring human attributes from images of people under large variation of viewpoint, pose, appearance, articulation and occlusion is proposed.
C3D: Generic Features for Video Analysis
Convolution 3D feature is proposed, a generic spatio-temporal feature obtained by training a deep 3-dimensional convolutional network on a large annotated video dataset comprising objects, scenes, actions, and other frequently occurring concepts that encapsulate appearance and motion cues and perform well on different video classification tasks.
Beyond frontal faces: Improving Person Recognition using multiple cues
The Pose Invariant PErson Recognition (PIPER) method is proposed, which accumulates the cues of poselet-level person recognizers trained by deep convolutional networks to discount for the pose variations, combined with a face recognizer and a global recognizer.
Training Convolutional Networks with Noisy Labels
An extra noise layer is introduced into the network which adapts the network outputs to match the noisy label distribution and can be estimated as part of the training process and involve simple modifications to current training infrastructures for deep networks.
ConvNet Architecture Search for Spatiotemporal Feature Learning
This paper presents an empirical ConvNet architecture search for spatiotemporal feature learning, culminating in a deep 3D Residual ConvNet that outperforms C3D by a good margin on Sports-1M, UCF101, HMDB51, THUMOS14, and ASLAN while being 2 times faster at inference time, 2 times smaller in model size, and having a more compact representation.
Detect-and-Track: Efficient Pose Estimation in Videos
This paper addresses the problem of estimating and tracking human body keypoints in complex, multi-person video. We propose an extremely lightweight yet highly effective approach that builds upon the
Metric Learning with Adaptive Density Discrimination
This work proposes a novel approach explicitly designed to address a number of subtle yet important issues which have stymied earlier DML algorithms, which maintains an explicit model of the distributions of the different classes in representation space and employs this knowledge to adaptively assess similarity, and achieve local discrimination by penalizing class distribution overlap.