Temporal Relational Reasoning in Videos

@article{Zhou2018TemporalRR,
  title={Temporal Relational Reasoning in Videos},
  author={Bolei Zhou and A. Andonian and A. Torralba},
  journal={ArXiv},
  year={2018},
  volume={abs/1711.08496}
}
Temporal relational reasoning, the ability to link meaningful transformations of objects or entities over time, is a fundamental property of intelligent species. [...] Key Result Further analyses show that the models learn intuitive and interpretable visual common sense knowledge in videos (Code and models are available at http://relation.csail.mit.edu/.).Expand

Paper Mentions

Unified Graph Structured Models for Video Understanding
TLDR
This paper proposes a message passing graph neural network that explicitly models these spatio-temporal relations and can use explicit representations of objects, when supervision is available, and implicit representations otherwise, and generalises previous structured models for video understanding. Expand
Spatial-Temporal Relation Reasoning for Action Prediction in Videos
TLDR
This paper focuses on reasoning about the spatial–temporal relations between persons and contextual objects to interpret the observed video part for predicting action categories, and proposes an improved gated graph neural network to perform spatial relation reasoning between the visual objects in video frames. Expand
Video Relation Detection with Spatio-Temporal Graph
TLDR
By combining the model (VRD-GCN) and the proposed association method, the framework for video relation detection achieves the best performance in the latest benchmarks and a series of ablation studies demonstrate the method's effectiveness. Expand
Visual Relationship Forecasting in Videos
TLDR
A novel Graph Convolutional Transformer (GCT) framework, which captures both object-level and frame-level dependencies by spatio-temporally localized visual relation annotations in a video, and demonstrates that GCT outperforms the state-of-the-art sequence modelling methods on visual relationship forecasting. Expand
Videos as Space-Time Region Graphs
TLDR
The proposed graph representation achieves state-of-the-art results on the Charades and Something-Something datasets and obtains a huge gain when the model is applied in complex environments. Expand
SSAN: Separable Self-Attention Network for Video Representation Learning
TLDR
A separable self-attention (SSA) module is proposed, which models spatial and temporal correlations sequentially, so that spatial contexts can be efficiently used in temporal modeling. Expand
Spatio-temporal Relational Reasoning for Video Question Answering
TLDR
This work presents a novel spatio-temporal reasoning neural module which enables modeling complex multi-entity relationships in space and long-term ordered dependencies in time and achieves state-of-the-art performance on two benchmark datasets: TGIF-QA and SVQA. Expand
Temporal Reasoning Graph for Activity Recognition
TLDR
This paper proposes an efficient temporal reasoning graph (TRG) to simultaneously capture the appearance features and temporal relation between video sequences at multiple time scales, and constructs learnable temporal relation graphs to explore temporal relation on the multi-scale range. Expand
Temporal Pyramid Pooling Based Relation Network for Action Recognition
TLDR
A novel temporal pyramid pooling based relation network (TPPRN) is proposed to learn spatiotemporal representations in an end-to-end fashion and achieves the comparable performance with the state-of-the-art. Expand
We Have So Much In Common: Modeling Semantic Relational Set Abstractions in Videos
TLDR
This work combines visual features with natural language supervision to generate high-level representations of similarities across a set of videos, which allows the model to perform cognitive tasks such as set abstraction, set completion, and odd one out detection. Expand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 39 REFERENCES
A simple neural network module for relational reasoning
TLDR
This work shows how a deep learning architecture equipped with an RN module can implicitly discover and learn to reason about entities and their relations. Expand
The “Something Something” Video Database for Learning and Evaluating Visual Common Sense
TLDR
This work describes the ongoing collection of the “something-something” database of video prediction tasks whose solutions require a common sense understanding of the depicted situation, and describes the challenges in crowd-sourcing this data at scale. Expand
Asynchronous Temporal Fields for Action Recognition
TLDR
This work proposes a fully-connected temporal CRF model for reasoning over various aspects of activities that includes objects, actions, and intentions, where the potentials are predicted by a deep network. Expand
Time-Contrastive Networks: Self-Supervised Learning from Multi-view Observation
TLDR
The first self-supervised results for end-to-end imitation learning of human motions with a real robot are shown, and the contrastive signal encourages the model to discover meaningful dimensions and attributes that can explain the changing state of objects and the world from visually similar frames. Expand
Long-term recurrent convolutional networks for visual recognition and description
TLDR
A novel recurrent convolutional architecture suitable for large-scale visual learning which is end-to-end trainable, and shows such models have distinct advantages over state-of-the-art models for recognition or generation which are separately defined and/or optimized. Expand
Temporal Segment Networks: Towards Good Practices for Deep Action Recognition
Deep convolutional networks have achieved great success for visual recognition in still images. However, for action recognition in videos, the advantage over traditional methods is not so evident.Expand
What Actions are Needed for Understanding Human Actions in Videos?
TLDR
This work analyzes datasets, evaluation metrics, algorithms, and potential future directions to discover that fine-grained understanding of objects and pose when combined with temporal reasoning is likely to yield substantial improvements in algorithmic accuracy. Expand
Time-Contrastive Networks: Self-Supervised Learning from Video
TLDR
A self-supervised approach for learning representations and robotic behaviors entirely from unlabeled videos recorded from multiple viewpoints is proposed, and it is demonstrated that this representation can be used by a robot to directly mimic human poses without an explicit correspondence, and that it can be use as a reward function within a reinforcement learning algorithm. Expand
Parsing Videos of Actions with Segmental Grammars
TLDR
This work describes simple grammars that capture hierarchical temporal structure while admitting inference with a finite-state-machine, which makes parsing linear time, constant storage, and naturally online easier. Expand
Activity representation with motion hierarchies
TLDR
This paper introduces a spectral divisive clustering algorithm to efficiently extract a hierarchy over a large number of tracklets and provides an efficient positive definite kernel that computes the structural and visual similarity of two hierarchical decompositions by relying on models of their parent–child relations. Expand
...
1
2
3
4
...