Dynamic Graph Modules for Modeling Higher-Order Interactions in Activity Recognition
@article{Huang2018DynamicGM, title={Dynamic Graph Modules for Modeling Higher-Order Interactions in Activity Recognition}, author={Hao Huang and Luowei Zhou and Wei Zhang and Chenliang Xu}, journal={ArXiv}, year={2018}, volume={abs/1812.05637} }
Video action recognition, as a critical problem towards video understanding, has attracted increasing attention recently. To identify an action involving higher-order object interactions, we need to consider: 1) spatial relations among objects in a single frame; 2) temporal relations between different/same objects across multiple frames. However, previous approaches, e.g., 2D ConvNet + LSTM or 3D ConvNet, are either incapable of capturing relations between objects, or unable to handle streaming…
3 Citations
Action Graphs: Weakly-supervised Action Localization with Graph Convolution Networks
- Computer Science2020 IEEE Winter Conference on Applications of Computer Vision (WACV)
- 2020
This work presents a novel method that uses graph convolutions to explicitly model similarity between video moments and pushes the state of the art on THUMOS’14, ActivityNet 1.2, and Charades for weakly- supervised action localization.
Neural Message Passing on Hybrid Spatio-Temporal Visual and Symbolic Graphs for Video Understanding
- Computer ScienceArXiv
- 2019
The generality of the approach is demonstrated on a variety of tasks, such as temporal subactivity classification and object affordance classification on the CAD-120 dataset and multilabel temporal action localization on the large scale Charades dataset, where it outperform existing deep learning approaches, using only raw RGB frames.
Representation Learning on Visual-Symbolic Graphs for Video Understanding
- Computer ScienceECCV
- 2020
A graph neural network for refining the representations of actors, objects and their interactions on the resulting hybrid graph goes beyond current approaches that assume nodes and edges are of the same type, operate on graphs with fixed edge weights and do not use a symbolic graph.
References
SHOWING 1-10 OF 47 REFERENCES
Attend and Interact: Higher-Order Object Interactions for Video Understanding
- Computer Science2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition
- 2018
It is demonstrated that modeling object interactions significantly improves accuracy for both action recognition and video captioning, while saving more than 3-times the computation over traditional pairwise relationships.
Videos as Space-Time Region Graphs
- Computer ScienceECCV
- 2018
The proposed graph representation achieves state-of-the-art results on the Charades and Something-Something datasets and obtains a huge gain when the model is applied in complex environments.
Appearance-and-Relation Networks for Video Classification
- Computer Science2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition
- 2018
This paper presents a new architecture, termed as Appearance-and-Relation Network (ARTNet), to learn video representation in an end-to-end manner, constructed by stacking multiple generic building blocks, called SMART, whose goal is to simultaneously model appearance and relation from RGB input in a separate and explicit manner.
ECO: Efficient Convolutional Network for Online Video Understanding
- Computer ScienceECCV
- 2018
A network architecture that takes long-term content into account and enables fast per-video processing at the same time and achieves competitive performance across all datasets while being 10 to 80 times faster than state-of-the-art methods.
VideoLSTM convolves, attends and flows for action recognition
- Computer ScienceComput. Vis. Image Underst.
- 2018
Temporal Relational Reasoning in Videos
- Computer ScienceECCV
- 2018
This paper introduces an effective and interpretable network module, the Temporal Relation Network (TRN), designed to learn and reason about temporal dependencies between video frames at multiple time scales.
Learning Human-Object Interactions by Graph Parsing Neural Networks
- Computer ScienceECCV
- 2018
This paper addresses the task of detecting and recognizing human-object interactions (HOI) in images and videos. We introduce the Graph Parsing Neural Network (GPNN), a framework that incorporates…
Exploring Visual Relationship for Image Captioning
- Computer ScienceECCV
- 2018
This paper introduces a new design to explore the connections between objects for image captioning under the umbrella of attention-based encoder-decoder framework that novelly integrates both semantic and spatial object relationships into image encoder.
Two-Stream Convolutional Networks for Action Recognition in Videos
- Computer ScienceNIPS
- 2014
This work proposes a two-stream ConvNet architecture which incorporates spatial and temporal networks and demonstrates that a ConvNet trained on multi-frame dense optical flow is able to achieve very good performance in spite of limited training data.
Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset
- Computer Science2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
- 2017
I3D models considerably improve upon the state-of-the-art in action classification, reaching 80.2% on HMDB-51 and 97.9% on UCF-101 after pre-training on Kinetics, and a new Two-Stream Inflated 3D Conv net that is based on 2D ConvNet inflation is introduced.