Corpus ID: 236428544

Spatial-Temporal Transformer for Dynamic Scene Graph Generation

@article{Cong2021SpatialTemporalTF,
  title={Spatial-Temporal Transformer for Dynamic Scene Graph Generation},
  author={Yuren Cong and Wentong Liao and Hanno Ackermann and Michael Ying Yang and Bodo Rosenhahn},
  journal={ArXiv},
  year={2021},
  volume={abs/2107.12309}
}
Dynamic scene graph generation aims at generating a scene graph of the given video. Compared to the task of scene graph generation from images, it is more challenging because of the dynamic relationships between objects and the temporal dependencies between frames allowing for a richer semantic interpretation. In this paper, we propose Spatial-temporal Transformer (STTran), a neural network that consists of two core modules: (1) a spatial encoder that takes an input frame to extract spatial… Expand

References

SHOWING 1-10 OF 70 REFERENCES
Scene Graph Generation from Objects, Phrases and Region Captions
TLDR
This work proposes a novel neural network model, termed as Multi-level Scene Description Network (denoted as MSDN), to solve the three vision tasks jointly in an end-to-end manner and shows the joint learning across three tasks with the proposed method can bring mutual improvements over previous models. Expand
Factorizable Net: An Efficient Subgraph-based Framework for Scene Graph Generation
TLDR
A subgraph-based connection graph is proposed to concisely represent the scene graph during the inference to improve the efficiency of scene graph generation and outperforms the state-of-the-art method in both accuracy and speed. Expand
On Support Relations and Semantic Scene Graphs
TLDR
This paper proposes a novel framework for automatic generation of semantic scene graphs which interpret indoor environments using a Convolutional Neural Network to detect objects of interest and a semantic scene graph describing the contextual relations within a cluttered indoor scene is constructed. Expand
Graph R-CNN for Scene Graph Generation
TLDR
A novel scene graph generation model called Graph R-CNN, that is both effective and efficient at detecting objects and their relations in images, is proposed and a new evaluation metric is introduced that is more holistic and realistic than existing metrics. Expand
Learning to Compose Dynamic Tree Structures for Visual Contexts
TLDR
A hybrid learning procedure is developed which integrates end-task supervised learning and the tree structure reinforcement learning, where the former's evaluation result serves as a self-critic for the latter's structure exploration. Expand
Image Generation from Scene Graphs
TLDR
This work proposes a method for generating images from scene graphs, enabling explicitly reasoning about objects and their relationships, and validates this approach on Visual Genome and COCO-Stuff. Expand
Image Captioning with Scene-graph Based Semantic Concepts
TLDR
This paper explores the co-occurrence dependency of high-level semantic concepts and proposes a novel method with scene-graph based semantic representation for image captioning using a CNN-RNN-SVM framework to generate the scene- graph-based sequence. Expand
Scene Graph Generation by Iterative Message Passing
TLDR
This work explicitly model the objects and their relationships using scene graphs, a visually-grounded graphical structure of an image, and proposes a novel end-to-end model that generates such structured scene representation from an input image. Expand
GPS-Net: Graph Property Sensing Network for Scene Graph Generation
TLDR
A novel message passing module that augments the node feature with node-specific contextual information and encodes the edge direction information via a tri-linear model is proposed that achieves state-of-the-art performance on three popular databases: VG, OI, and VRD by significant gains under various settings and metrics. Expand
Exploring Context and Visual Pattern of Relationship for Scene Graph Generation
TLDR
In order to discover effective pattern for relationship, traditional relationship feature extraction methods such as using union region or combination of subject-object feature pairs are replaced with the proposed intersection region which focuses on more essential parts. Expand
...
1
2
3
4
5
...