Motion Feature Network: Fixed Motion Filter for Action Recognition

  title={Motion Feature Network: Fixed Motion Filter for Action Recognition},
  author={Myunggi Lee and Seungeui Lee and Sung Joon Son and Gyutae Park and Nojun Kwak},
Spatio-temporal representations in frame sequences play an important role in the task of action recognition. Previously, a method of using optical flow as a temporal information in combination with a set of RGB images that contain spatial information has shown great performance enhancement in the action recognition tasks. However, it has an expensive computational cost and requires two-stream (RGB and optical flow) framework. In this paper, we propose MFNet (Motion Feature Network) containing… 
A Spatio-Temporal Motion Network for Action Recognition Based on Spatial Attention
A generic and effective module called spatio-temporal motion network (SMNet), which maintains the complexity of 2D and reduces the computational effort of the algorithm while achieving performance comparable to 3D CNNs.
STM: SpatioTemporal and Motion Encoding for Action Recognition
This work proposes a STM block, which contains a Channel-wise SpatioTemporal Module (CSTM) to present the spatiotemporal features and aChannel-wise Motion Module(CMM) to efficiently encode motion features, and replaces original residual blocks in the ResNet architecture with STM blcoks to form a simple yet effective STM network.
Gate-Shift Networks for Video Action Recognition
An extensive evaluation of the proposed Gate-Shift Module is performed to study its effectiveness in video action recognition, achieving state-of-the-art results on Something Something-V1 and Diving48 datasets, and obtaining competitive results on EPIC-Kitchens with far less model complexity.
MARS: Motion-Augmented RGB Stream for Action Recognition
This paper introduces two learning approaches to train a standard 3D CNN, operating on RGB frames, that mimics the motion stream, and as a result avoids flow computation at test time, and denotes the stream trained using this combined loss as Motion-Augmented RGB Stream (MARS).
Video Modeling With Correlation Networks
This paper proposes an alternative approach based on a learnable correlation operator that can be used to establish frame-to-frame matches over convolutional feature maps in the different layers of the network.
D3D: Distilled 3D Networks for Video Action Recognition
This work investigates whether motion representations are indeed missing in the spatial stream, and shows that there is significant room for improvement, and demonstrates that these motion representations can be improved using distillation, that is, by tuning the spatial streams to mimic the temporal stream, effectively combining both models into a single stream.
PAN: Towards Fast Action Recognition via Learning Persistence of Appearance
This paper designs a novel motion cue called Persistence of Appearance (PA), which is over 1000x faster than conventional optical flow in terms of motion modeling speed, and devise a global temporal fusion strategy called Various-timescale Aggregation Pooling (VAP) that can adaptively model long-range temporal relationships across various timescales.
MotionSqueeze: Neural Motion Feature Learning for Video Understanding
This work proposes a trainable neural module, dubbed MotionSqueeze, for effective motion feature extraction, and demonstrates that the proposed method provides a significant gain on four standard benchmarks for action recognition with only a small amount of additional cost, outperforming the state of the art on Something-Something-V1&V2 datasets.
ACTION-Net: Multipath Excitation for Action Recognition
This work designs a generic and effective module that can be embedded into 2D CNNs to form a simple yet effective ACTION-Net with very limited extra computational cost.
Learning Self-Similarity in Space and Time as Generalized Motion for Video Action Recognition
A rich and robust motion representation based on spatio-temporal self-similarity (STSS), which effectively captures long-term interaction and fast motion in the video, leading to robust action recognition.


Optical Flow Guided Feature: A Fast and Robust Motion Representation for Video Action Recognition
A novel compact motion representation for video action recognition, named Optical Flow guided Feature (OFF), which enables the network to distill temporal information through a fast and robust approach and is complementary to other motion modalities such as optical flow.
Hidden Two-Stream Convolutional Networks for Action Recognition
This paper presents a novel CNN architecture that implicitly captures motion information between adjacent frames and directly predicts action classes without explicitly computing optical flow, and significantly outperforms the previous best real-time approaches.
Real-Time Action Recognition with Enhanced Motion Vector CNNs
This paper accelerates the deep two-stream architecture by replacing optical flow with motion vector which can be obtained directly from compressed videos without extra calculation, and introduces three strategies for this, initialization transfer, supervision transfer and their combination.
Temporal Segment Networks: Towards Good Practices for Deep Action Recognition
Deep convolutional networks have achieved great success for visual recognition in still images. However, for action recognition in videos, the advantage over traditional methods is not so evident.
Convolutional Two-Stream Network Fusion for Video Action Recognition
A new ConvNet architecture for spatiotemporal fusion of video snippets is proposed, and its performance on standard benchmarks where this architecture achieves state-of-the-art results is evaluated.
Spatiotemporal Multiplier Networks for Video Action Recognition
A general ConvNet architecture for video action recognition based on multiplicative interactions of spacetime features that combines the appearance and motion pathways of a two-stream architecture by motion gating and is trained end-to-end.
ActionFlowNet: Learning Motion Representation for Action Recognition
This work proposes a multitask learning model ActionFlowNet to train a single stream network directly from raw pixels to jointly estimate optical flow while recognizing actions with convolutional neural networks, capturing both appearance and motion in a single model.
Appearance-and-Relation Networks for Video Classification
  • Limin WangWei LiWen LiL. Gool
  • Computer Science
    2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition
  • 2018
This paper presents a new architecture, termed as Appearance-and-Relation Network (ARTNet), to learn video representation in an end-to-end manner, constructed by stacking multiple generic building blocks, called SMART, whose goal is to simultaneously model appearance and relation from RGB input in a separate and explicit manner.
ConvNet Architecture Search for Spatiotemporal Feature Learning
This paper presents an empirical ConvNet architecture search for spatiotemporal feature learning, culminating in a deep 3D Residual ConvNet that outperforms C3D by a good margin on Sports-1M, UCF101, HMDB51, THUMOS14, and ASLAN while being 2 times faster at inference time, 2 times smaller in model size, and having a more compact representation.
Learning Spatio-Temporal Features with 3D Residual Networks for Action Recognition
The 3D ResNets trained on the Kinetics did not suffer from overfitting despite the large number of parameters of the model, and achieved better performance than relatively shallow networks, such as C3D.