• Corpus ID: 55701726

Dynamic Graph Modules for Modeling Higher-Order Interactions in Activity Recognition

@article{Huang2018DynamicGM,
  title={Dynamic Graph Modules for Modeling Higher-Order Interactions in Activity Recognition},
  author={Hao Huang and Luowei Zhou and Wei Zhang and Chenliang Xu},
  journal={ArXiv},
  year={2018},
  volume={abs/1812.05637}
}
Video action recognition, as a critical problem towards video understanding, has attracted increasing attention recently. To identify an action involving higher-order object interactions, we need to consider: 1) spatial relations among objects in a single frame; 2) temporal relations between different/same objects across multiple frames. However, previous approaches, e.g., 2D ConvNet + LSTM or 3D ConvNet, are either incapable of capturing relations between objects, or unable to handle streaming… 

Figures and Tables from this paper

Action Graphs: Weakly-supervised Action Localization with Graph Convolution Networks

This work presents a novel method that uses graph convolutions to explicitly model similarity between video moments and pushes the state of the art on THUMOS’14, ActivityNet 1.2, and Charades for weakly- supervised action localization.

Neural Message Passing on Hybrid Spatio-Temporal Visual and Symbolic Graphs for Video Understanding

The generality of the approach is demonstrated on a variety of tasks, such as temporal subactivity classification and object affordance classification on the CAD-120 dataset and multilabel temporal action localization on the large scale Charades dataset, where it outperform existing deep learning approaches, using only raw RGB frames.

Representation Learning on Visual-Symbolic Graphs for Video Understanding

A graph neural network for refining the representations of actors, objects and their interactions on the resulting hybrid graph goes beyond current approaches that assume nodes and edges are of the same type, operate on graphs with fixed edge weights and do not use a symbolic graph.

References

SHOWING 1-10 OF 47 REFERENCES

Attend and Interact: Higher-Order Object Interactions for Video Understanding

It is demonstrated that modeling object interactions significantly improves accuracy for both action recognition and video captioning, while saving more than 3-times the computation over traditional pairwise relationships.

Videos as Space-Time Region Graphs

The proposed graph representation achieves state-of-the-art results on the Charades and Something-Something datasets and obtains a huge gain when the model is applied in complex environments.

Appearance-and-Relation Networks for Video Classification

  • Limin WangWei LiWen LiL. Gool
  • Computer Science
    2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition
  • 2018
This paper presents a new architecture, termed as Appearance-and-Relation Network (ARTNet), to learn video representation in an end-to-end manner, constructed by stacking multiple generic building blocks, called SMART, whose goal is to simultaneously model appearance and relation from RGB input in a separate and explicit manner.

ECO: Efficient Convolutional Network for Online Video Understanding

A network architecture that takes long-term content into account and enables fast per-video processing at the same time and achieves competitive performance across all datasets while being 10 to 80 times faster than state-of-the-art methods.

Temporal Relational Reasoning in Videos

This paper introduces an effective and interpretable network module, the Temporal Relation Network (TRN), designed to learn and reason about temporal dependencies between video frames at multiple time scales.

Learning Human-Object Interactions by Graph Parsing Neural Networks

This paper addresses the task of detecting and recognizing human-object interactions (HOI) in images and videos. We introduce the Graph Parsing Neural Network (GPNN), a framework that incorporates

Exploring Visual Relationship for Image Captioning

This paper introduces a new design to explore the connections between objects for image captioning under the umbrella of attention-based encoder-decoder framework that novelly integrates both semantic and spatial object relationships into image encoder.

Two-Stream Convolutional Networks for Action Recognition in Videos

This work proposes a two-stream ConvNet architecture which incorporates spatial and temporal networks and demonstrates that a ConvNet trained on multi-frame dense optical flow is able to achieve very good performance in spite of limited training data.

Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset

I3D models considerably improve upon the state-of-the-art in action classification, reaching 80.2% on HMDB-51 and 97.9% on UCF-101 after pre-training on Kinetics, and a new Two-Stream Inflated 3D Conv net that is based on 2D ConvNet inflation is introduced.