Corpus ID: 201646783

Cooperative Cross-Stream Network for Discriminative Action Representation

@article{Zhang2019CooperativeCN,
  title={Cooperative Cross-Stream Network for Discriminative Action Representation},
  author={Jingran Zhang and Fumin Shen and Xing Xu and Heng Tao Shen},
  journal={ArXiv},
  year={2019},
  volume={abs/1908.10136}
}
Spatial and temporal stream model has gained great success in video action recognition. Most existing works pay more attention to designing effective features fusion methods, which train the two-stream model in a separate way. However, it's hard to ensure discriminability and explore complementary information between different streams in existing works. In this work, we propose a novel cooperative cross-stream network that investigates the conjoint information in multiple different modalities… Expand
1 Citations
Enhanced Action Recognition Using Multiple Stream Deep Learning with Optical Flow and Weighted Sum
TLDR
A novel action recognition method that improves the existing method using optical flow and a multi-stream structure and outperformed many state-of-the-art methods without changing the network structure and it is expected to be easily applied to other networks. Expand

References

SHOWING 1-10 OF 49 REFERENCES
ActionVLAD: Learning Spatio-Temporal Aggregation for Action Classification
TLDR
A new video representation for action classification that aggregates local convolutional features across the entire spatio-temporal extent of the video and outperforms other baselines with comparable base architectures on HMDB51, UCF101, and Charades video classification benchmarks. Expand
Two-Stream Convolutional Networks for Action Recognition in Videos
TLDR
This work proposes a two-stream ConvNet architecture which incorporates spatial and temporal networks and demonstrates that a ConvNet trained on multi-frame dense optical flow is able to achieve very good performance in spite of limited training data. Expand
Convolutional Two-Stream Network Fusion for Video Action Recognition
TLDR
A new ConvNet architecture for spatiotemporal fusion of video snippets is proposed, and its performance on standard benchmarks where this architecture achieves state-of-the-art results is evaluated. Expand
Action recognition with trajectory-pooled deep-convolutional descriptors
TLDR
This paper presents a new video representation, called trajectory-pooled deep-convolutional descriptor (TDD), which shares the merits of both hand-crafted features and deep-learned features, and achieves superior performance to the state of the art on these datasets. Expand
Modeling Spatial-Temporal Clues in a Hybrid Deep Learning Framework for Video Classification
TLDR
This work proposes a hybrid deep learning framework for video classification, which is able to model static spatial information, short-term motion, as well as long-term temporal clues in the videos, and achieves very competitive performance on two popular and challenging benchmarks. Expand
Temporal Segment Networks: Towards Good Practices for Deep Action Recognition
Deep convolutional networks have achieved great success for visual recognition in still images. However, for action recognition in videos, the advantage over traditional methods is not so evident.Expand
Spatiotemporal Residual Networks for Video Action Recognition
TLDR
The novel spatiotemporal ResNet is introduced and evaluated using two widely used action recognition benchmarks where it exceeds the previous state-of-the-art. Expand
Temporal–Spatial Mapping for Action Recognition
TLDR
This work introduces a simple yet effective operation, termed temporal–spatial mapping, for capturing the temporal evolution of the frames by jointly analyzing all the frames of a video, and proposes a temporal attention model within a shallow convolutional neural network to efficiently exploit the temporal-spatial dynamics. Expand
End-to-end Video-level Representation Learning for Action Recognition
TLDR
This paper builds upon two-stream ConvNets and proposes Deep networks with Temporal Pyramid Pooling (DTPP), an end-to-end video-level representation learning approach, to address problems of partial observation training and single temporal scale modeling in action recognition. Expand
Video Representation Learning Using Discriminative Pooling
TLDR
This work proposes discriminative pooling, based on the notion that among the deep features generated on all short clips, there is at least one that characterizes the action, and learns a (nonlinear) hyperplane that separates this unknown, yet discrim inative, feature from the rest. Expand
...
1
2
3
4
5
...