Structural-RNN: Deep Learning on Spatio-Temporal Graphs

  title={Structural-RNN: Deep Learning on Spatio-Temporal Graphs},
  author={Ashesh Jain and Amir Roshan Zamir and Silvio Savarese and Ashutosh Saxena},
  journal={2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
Deep Recurrent Neural Network architectures, though remarkably capable at modeling sequences, lack an intuitive high-level spatio-temporal structure. [] Key Method We develop a scalable method for casting an arbitrary spatio-temporal graph as a rich RNN mixture that is feedforward, fully differentiable, and jointly trainable. The proposed method is generic and principled as it can be used for transforming any spatio-temporal graph through employing a certain set of well defined steps. The evaluations of the…

Figures and Tables from this paper

ST-UNet: A Spatio-Temporal U-Network for Graph-structured Time Series Modeling

A novel multi-scale architecture, Spatio-Temporal U-Net (ST-UNet), for graph-structured time series modeling, which effectively captures comprehensive features in multiple scales and achieves substantial improvements over mainstream methods on several real-world datasets.

Graph WaveNet for Deep Spatial-Temporal Graph Modeling

This paper proposes a novel graph neural network architecture, Graph WaveNet, for spatial-temporal graph modeling by developing a novel adaptive dependency matrix and learn it through node embedding, which can precisely capture the hidden spatial dependency in the data.

Recurrent Space-time Graph Neural Networks

This work proposes a neural graph model, recurrent in space and time, suitable for capturing both the local appearance and the complex higher-level interactions of different entities and objects within the changing world scene and obtains state-of-the-art performance on the challenging Something-Something human-object interaction dataset.

Recurrent Space-time Graphs for Video Understanding

This work proposes a neural graph model, recurrent in space and time, suitable for capturing both the local appearance and the complex higher-level interactions of different entities and objects within the changing world scene and obtains state-of-the-art performance on the challenging Something-Something human-object interaction dataset.

FedRel: An Adaptive Federated Relevance Framework for Spatial Temporal Graph Learning

An adaptive federated relevance framework, namely FedRel, for spatial-temporal graph learning is proposed and a relevance-driven federated learning module in the framework is designed to leverage diverse data distributions from different participants with attentive aggregations of their models.

Structured Sequence Modeling with Graph Convolutional Recurrent Networks

The proposed model combines convolutional neural networks on graphs to identify spatial structures and RNN to find dynamic patterns in data structured by an arbitrary graph.

STTG-net: a Spatio-temporal network for human motion prediction based on transformer and graph convolution network

A novel spatio-temporal network based on a transformer and a gragh convolutional network (GCN) (STTG-Net) is proposed, to overcome the problems of error accumulation and discontinuity in the motion prediction.

Understanding Dynamic Scenes using Graph Convolution Networks

A novel Multi-Relational Graph Convolutional Network (MRGCN) based framework to model on-road vehicle behaviors from a sequence of temporally ordered frames as grabbed by a moving monocular camera achieves significant performance gain over prior methods on vehicle-behavior classification tasks on four datasets.

Stacked Spatio-Temporal Graph Convolutional Networks for Action Segmentation

The Stacked-STGCN in general achieves improved performance over the state-of- the-art for both CAD120 and Charades, and can be applied to a wider range of applications that require structured inference over long sequences with heterogeneous data types and varied temporal extent.

Space-Time-Separable Graph Convolutional Network for Pose Forecasting

For the first time, STS-GCN models the human pose dynamics only with a graph convolutional network (GCN), including the temporal evolution and the spatial joint interaction within a single-graph framework, which allows the cross-talk of motion and spatial correlations.



Long-term recurrent convolutional networks for visual recognition and description

A novel recurrent convolutional architecture suitable for large-scale visual learning which is end-to-end trainable, and shows such models have distinct advantages over state-of-the-art models for recognition or generation which are separately defined and/or optimized.

Learning Spatio-Temporal Structure from RGB-D Videos for Human Activity Detection and Anticipation

This paper proposes a graph structure that improves the state-of-the-art significantly for detecting past activities as well as for anticipating future activities, on a dataset of 120 activity videos collected from four subjects.

Conditional Random Fields as Recurrent Neural Networks

A new form of convolutional neural network that combines the strengths of Convolutional Neural Networks (CNNs) and Conditional Random Fields (CRFs)-based probabilistic graphical modelling is introduced, and top results are obtained on the challenging Pascal VOC 2012 segmentation benchmark.

Learning spatiotemporal graphs of human activities

The model is used for parsing new videos in terms of detecting and localizing relevant activity parts, and out-perform the state of the art on benchmark Olympic and UT human-interaction datasets, under a favorable complexity-vs-accuracy trade-off.

Learning Deep Structured Models

This paper proposes a training algorithm that is able to learn structured models jointly with deep features that form the MRF potentials and demonstrates the effectiveness of this algorithm in the tasks of predicting words from noisy images, as well as tagging of Flickr photographs.

Visualizing and Understanding Recurrent Networks

This work uses character-level language models as an interpretable testbed to provide an analysis of LSTM representations, predictions and error types, and reveals the existence of interpretable cells that keep track of long-range dependencies such as line lengths, quotes and brackets.

Recurrent Network Models for Human Dynamics

The Encoder-Recurrent-Decoder (ERD) model is a recurrent neural network that incorporates nonlinear encoder and decoder networks before and after recurrent layers that extends previous Long Short Term Memory models in the literature to jointly learn representations and their dynamics.

Unsupervised Learning of Video Representations using LSTMs

This work uses Long Short Term Memory networks to learn representations of video sequences and evaluates the representations by finetuning them for a supervised learning problem - human action recognition on the UCF-101 and HMDB-51 datasets.

Translating Videos to Natural Language Using Deep Recurrent Neural Networks

This paper proposes to translate videos directly to sentences using a unified deep neural network with both convolutional and recurrent structure, to create sentence descriptions of open-domain videos with large vocabularies.

Efficient Piecewise Training of Deep Structured Models for Semantic Segmentation

This work shows how to improve semantic segmentation through the use of contextual information, specifically, ' patch-patch' context between image regions, and 'patch-background' context, and formulate Conditional Random Fields with CNN-based pairwise potential functions to capture semantic correlations between neighboring patches.