• Publications
  • Influence
Action Recognition with Improved Trajectories
  • Heng Wang, C. Schmid
  • Mathematics, Computer Science
    IEEE International Conference on Computer Vision
  • 1 December 2013
Dense trajectories were shown to be an efficient video representation for action recognition and achieved state-of-the-art results on a variety of datasets are improved by taking into account camera motion to correct them.
Action recognition by dense trajectories
This work introduces a novel descriptor based on motion boundary histograms, which is robust to camera motion and consistently outperforms other state-of-the-art descriptors, in particular in uncontrolled realistic videos.
Dense Trajectories and Motion Boundary Descriptors for Action Recognition
The MBH descriptor shows to consistently outperform other state-of-the-art descriptors, in particular on real-world videos that contain a significant amount of camera motion.
Evaluation of Local Spatio-temporal Features for Action Recognition
It is demonstrated that regular sampling of space-time features consistently outperforms all testedspace-time interest point detectors for human actions in realistic settings and is a consistent ranking for the majority of methods over different datasets.
A Closer Look at Spatiotemporal Convolutions for Action Recognition
A new spatiotemporal convolutional block "R(2+1)D" is designed which produces CNNs that achieve results comparable or superior to the state-of-the-art on Sports-1M, Kinetics, UCF101, and HMDB51.
Is Space-Time Attention All You Need for Video Understanding?
This paper presents a convolution-free approach to video classification built exclusively on self-attention over space and time, and suggests that “divided attention,” where temporal attention and spatial attention are separately applied within each block, leads to the best video classification accuracy among the design choices considered.
Video Classification With Channel-Separated Convolutional Networks
It is empirically demonstrated that the amount of channel interactions plays an important role in the accuracy of 3D group convolutional networks, and this leads to an architecture -- Channel-Separated Convolutional Network (CSN) -- which is simple, efficient, yet accurate.
A Robust and Efficient Video Representation for Action Recognition
It is found that the improved trajectory features significantly outperform previous dense trajectories, and that Fisher vectors are superior to BOW encodings for video recognition tasks.
Large-Scale Weakly-Supervised Pre-Training for Video Action Recognition
The primary empirical finding is that pre-training at a very large scale (over 65 million videos), despite on noisy social-media videos and hashtags, substantially improves the state-of-the-art on three challenging public action recognition datasets.
Video Modeling With Correlation Networks
This paper proposes an alternative approach based on a learnable correlation operator that can be used to establish frame-to-frame matches over convolutional feature maps in the different layers of the network.