Corpus ID: 220250335

Compositional Video Synthesis with Action Graphs

  title={Compositional Video Synthesis with Action Graphs},
  author={Amir Bar and Roei Herzig and Xiaolong Wang and Gal Chechik and Trevor Darrell and Amir Globerson},
Videos of actions are complex spatio-temporal signals, containing rich compositional structures. Current generative models are limited in their ability to generate examples of object configurations outside the range they were trained on. Towards this end, we introduce a generative model (AG2Vid) based on Action Graphs, a natural and convenient structure that represents the dynamics of actions between objects over time. Our AG2Vid model disentangles appearance and position features, allowing for… Expand
Action Concept Grounding Network for Semantically-Consistent Video Generation
The task of semantic action-conditional video prediction, which can be regarded as an inverse problem of action recognition, is introduced and a novel video prediction model Action Concept Grounding Network (AGCN) is proposed. Expand
Compositional Transformers for Scene Generation
We introduce the GANformer2 model, an iterative object-oriented transformer, explored for the task of generative modeling. The network incorporates strong and explicit structural priors, to reflectExpand
Conditional Object-Centric Learning from Video
Object-centric representations are a promising path toward more systematic generalization by providing flexible abstractions upon which compositional world models can be built. Recent work on simpleExpand
Learning to Compose Visual Relations
The visual world around us can be described as a structured set of objects and their associated relations. An image of a room may be conjured given only the description of the underlying objects andExpand
Object-Region Video Transformers
Object-Region Video Transformers is presented, an object-centric approach that extends video transformer layers with a block that directly incorporates object representations throughout multiple transformer layers, demonstrating the value of a model that incorporates object representation into a transformer architecture. Expand
Research of Neural Networks Efficiency in Video Generation Problem
This paper describes the research of video generation using neural networks. The main ideas of several existing approaches based on the generative adversarial networks are presented and theirExpand
Learning Object Detection from Captions via Textual Scene Attributes
This work argues that captions contain much richer information about the image, including attributes of objects and their relations, and presents a method that uses the attributes in this "textual scene graph" to train object detectors. Expand
Modular Action Concept Grounding in Semantic Video Prediction
This work introduces the task of semantic action-conditional video prediction, which uses semantic action labels to describe those interactions and can be regarded as an inverse problem of action recognition. Expand


Compositional Video Prediction
An approach for pixel-level future prediction given an input image of a scene observing that a scene is comprised of distinct entities that undergo motion is presented and empirically validate the approach against alternate representations and ways of incorporating multi-modality. Expand
CATER: A diagnostic dataset for Compositional Actions and TEmporal Reasoning
This work builds a video dataset with fully observable and controllable object and scene bias, and which truly requires spatiotemporal understanding in order to be solved, and provides insights into some of the most recent state of the art deep video architectures. Expand
Generating Videos of Zero-Shot Compositions of Actions and Objects
The task of generating human-object interaction videos in a zero-shot compositional setting, i.e., generating videos for action-object compositions that are unseen during training, having seen the target action and target object separately is introduced. Expand
High-Resolution Image Synthesis and Semantic Manipulation with Conditional GANs
A new method for synthesizing high-resolution photo-realistic images from semantic label maps using conditional generative adversarial networks (conditional GANs) is presented, which significantly outperforms existing methods, advancing both the quality and the resolution of deep image synthesis and editing. Expand
Image Generation from Scene Graphs
This work proposes a method for generating images from scene graphs, enabling explicitly reasoning about objects and their relationships, and validates this approach on Visual Genome and COCO-Stuff. Expand
MoCoGAN: Decomposing Motion and Content for Video Generation
This work introduces a novel adversarial learning scheme utilizing both image and video discriminators and shows that MoCoGAN allows one to generate videos with same content but different motion as well as videos with different content and same motion. Expand
Video-to-Video Synthesis
This paper proposes a novel video-to-video synthesis approach under the generative adversarial learning framework, capable of synthesizing 2K resolution videos of street scenes up to 30 seconds long, which significantly advances the state-of-the-art of video synthesis. Expand
Deep Residual Learning for Image Recognition
This work presents a residual learning framework to ease the training of networks that are substantially deeper than those used previously, and provides comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. Expand
Adam: A Method for Stochastic Optimization
This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework. Expand
Action Genome: Actions As Compositions of Spatio-Temporal Scene Graphs
This work introduces Action Genome, a representation that decomposes actions into spatio-temporal scene graphs and demonstrates the utility of a hierarchical event decomposition by enabling few-shot action recognition, achieving 42.7% mAP using as few as 10 examples. Expand