Corpus ID: 220968856

Learning Long-term Visual Dynamics with Region Proposal Interaction Networks

  title={Learning Long-term Visual Dynamics with Region Proposal Interaction Networks},
  author={Haozhi Qi and Xiaolong Wang and Deepak Pathak and Yi Ma and Jitendra Malik},
Learning long-term dynamics models is the key to understanding physical common sense. Most existing approaches on learning dynamics from visual input sidestep long-term predictions by resorting to rapid re-planning with short-term models. This not only requires such models to be super accurate but also limits them only to tasks where an agent can continuously obtain feedback and take action at each step until completion. In this paper, we aim to leverage the ideas from success stories in visual… Expand
Interactive Fusion of Multi-level Features for Compositional Activity Recognition
This paper presents a novel framework that accomplishes interactive fusion by interactive fusion, namely, projecting features across different spaces and guiding it using an auxiliary prediction task, and achieves consistent accuracy gain beyond off-the-shelf action recognition algorithms. Expand
Dynamic Visual Reasoning by Learning Differentiable Physics Models from Video and Language
A unified framework that can jointly learn visual concepts and infer physics models of objects and their interactions from videos and language is proposed by seamlessly integrating three components: a visual perception module, a concept learner, and a differentiable physics engine. Expand
Physion: Evaluating Physical Prediction from Vision in Humans and Machines
It is demonstrated how this benchmark can identify areas for improvement and measure progress on this key aspect of physical understanding and how among models that receive only visual input, those with object-centric representations or pretraining do best but fall far short of human accuracy. Expand
Neural Production Systems
This work takes inspiration from cognitive science and resurrects a classic approach, production systems, which consist of a set of rule templates that are applied by binding placeholder variables in the rules to specific entities, which achieves a flexible, dynamic flow of control and serves to factorize entity-specific and rule-based information. Expand
Object-centric Video Prediction without Annotation
Object-centric Prediction without Annotation is presented, an object-centric video prediction method that takes advantage of priors from powerful computer vision models and how to adapt a perception model in an environment through end-to-end video prediction training. Expand
Online Learning of Unknown Dynamics for Model-Based Controllers in Legged Locomotion
This work proposes to learn a time-varying, locally linear residual model along the robot’s current trajectory, to compensate for the prediction errors of the controller's model. Expand
Physical Reasoning Using Dynamics-Aware Models
This study defines a distance measure between the trajectory of two target objects, and uses this distance measure to characterize the similarity of two environment rollouts and trains the model to correctly rank rollouts according to this measure in addition to predicting the correct reward. Expand
Robot Learning from Observations
  • 2021
In this article we present a view that robot learning should make use 1 of observational data in addition to learning through interaction with the world. 2 We hypothesize that the acquisition ofExpand
Forward Prediction for Physical Reasoning
It is found that forward-prediction models improve the performance of physical-reasoning agents, particularly on complex tasks that involve many objects, however, these improvements are contingent on the training tasks being similar to the test tasks, and that generalization to different tasks is more challenging. Expand


Entity Abstraction in Visual Model-Based Reinforcement Learning
Object-centric perception, prediction, and planning (OP3), which is the first fully probabilistic entity-centric dynamic latent variable framework for model-based reinforcement learning that acquires entity representations from raw visual observations without supervision and uses them to predict and plan, is presented. Expand
Visual Interaction Networks: Learning a Physics Simulator from Video
The Visual Interaction Network is introduced, a general-purpose model for learning the dynamics of a physical system from raw visual observations, consisting of a perceptual front-end based on convolutional neural networks and a dynamics predictor based on interaction networks. Expand
Reasoning About Physical Interactions with Object-Oriented Prediction and Planning
This work presents a paradigm for learning object-centric representations for physical scene understanding without direct supervision of object properties, and can use its learned representations to build block towers more complicated than those observed during training. Expand
Structured Object-Aware Physics Prediction for Video Modeling and Planning
STOVE is presented, a novel state-space model for videos, which explicitly reasons about objects and their positions, velocities, and interactions, and outperforms previous unsupervised models, and even approaches the performance of supervised baselines. Expand
Unsupervised Learning for Physical Interaction through Video Prediction
An action-conditioned video prediction model is developed that explicitly models pixel motion, by predicting a distribution over pixel motion from previous frames, and is partially invariant to object appearance, enabling it to generalize to previously unseen objects. Expand
Deep visual foresight for planning robot motion
This work develops a method for combining deep action-conditioned video prediction models with model-predictive control that uses entirely unlabeled training data and enables a real robot to perform nonprehensile manipulation — pushing objects — and can handle novel objects not seen during training. Expand
Visual Dynamics: Probabilistic Future Frame Synthesis via Cross Convolutional Networks
A novel approach that models future frames in a probabilistic manner is proposed, namely a Cross Convolutional Network to aid in synthesizing future frames; this network structure encodes image and motion information as feature maps and convolutional kernels, respectively. Expand
Learning Visual Predictive Models of Physics for Playing Billiards
This paper explores how an agent can be equipped with an internal model of the dynamics of the external world, and how it can use this model to plan novel actions by running multiple internal simulations ("visual imagination"). Expand
From Pixels to Torques: Policy Learning with Deep Dynamical Models
This paper introduces a data-efficient, model-based reinforcement learning algorithm that learns a closed-loop control policy from pixel information only, and facilitates fully autonomous learning from pixels to torques. Expand
Learning to Poke by Poking: Experiential Learning of Intuitive Physics
A novel approach based on deep neural networks is proposed for modeling the dynamics of robot's interactions directly from images, by jointly estimating forward and inverse models of dynamics. Expand