Visuomotor Control in Multi-Object Scenes Using Object-Aware Representations

  title={Visuomotor Control in Multi-Object Scenes Using Object-Aware Representations},
  author={Negin Heravi and Ayzaan Wahid and Corey Lynch and Peter R. Florence and Travis Armstrong and Jonathan Tompson and Pierre Sermanet and Jeannette Bohg and Debidatta Dwibedi},
, Abstract. Perceptual understanding of the scene and the relationship between its different components is important for successful completion of robotic tasks. Representation learning has been shown to be a power-ful technique for this, but most of the current methodologies learn task specific representations that do not necessarily transfer well to other tasks. Furthermore, representations learned by supervised methods require large labeled datasets for each task that are expensive to collect… 

Figures and Tables from this paper

Planning with Spatial-Temporal Abstraction from Point Clouds for Deformable Object Manipulation

This paper proposes PlAnning with Spatial and Temporal Abstraction (PASTA), which incorporates both spatial abstraction (reasoning about objects and their relations to each other) and temporal abstraction ( Reasoning over skills instead of low-level actions).



Deep Object-Centric Representations for Generalizable Robot Learning

This paper proposes using an object-centric prior and a semantic feature space for the perception system of a learned policy that can be used to determine relevant objects from a few trajectories or demonstrations, and then immediately incorporate those objects into a learning policy.

Grasp2Vec: Learning Object Representations from Self-Supervised Grasping

This paper studies how to acquire effective object-centric representations for robotic manipulation tasks without human labeling by using autonomous robot interaction with the environment using self-supervised methods.

Dense Object Nets: Learning Dense Visual Object Descriptors By and For Robotic Manipulation

Dense Object Nets are presented, which build on recent developments in self-supervised dense descriptor learning, as a consistent object representation for visual understanding and manipulation and are demonstrated they can be trained quickly for a wide variety of previously unseen and potentially non-rigid objects.

Learning Actionable Representations from Visual Observations

This work shows that the representations learned by agents observing themselves take random actions, or other agents perform tasks successfully, can enable the learning of continuous control policies using algorithms like Proximal Policy Optimization using only the learned embeddings as input.

SORNet: Spatial Object-Centric Representations for Sequential Manipulation

This work proposes SORNet, a framework for learning object-centric representations from RGB images conditioned on a set of object queries, represented as image patches called canonical object views, and evaluates it on various spatial reasoning tasks such as spatial relation classification and relative direction regression.

Time-Contrastive Networks: Self-Supervised Learning from Video

A self-supervised approach for learning representations and robotic behaviors entirely from unlabeled videos recorded from multiple viewpoints is proposed, and it is demonstrated that this representation can be used by a robot to directly mimic human poses without an explicit correspondence, and that it can be use as a reward function within a reinforcement learning algorithm.

Unsupervised Learning of Object Keypoints for Perception and Control

Transporter is introduced, a neural network architecture for discovering concise geometric object representations in terms of keypoints or image-space coordinates that helps track objects and object parts across long time-horizons more accurately than recent similar methods.

Deep spatial autoencoders for visuomotor learning

This work presents an approach that automates state-space construction by learning a state representation directly from camera images by using a deep spatial autoencoder to acquire a set of feature points that describe the environment for the current task, such as the positions of objects.

Keypoints into the Future: Self-Supervised Correspondence in Model-Based Reinforcement Learning

This work introduces model-based prediction with self-supervised visual correspondence learning, and shows that not only is this indeed possible, but these types of predictive models show compelling performance improvements over alternative methods for vision-based RL with autoencoder-type vision training.

Object-Centric Learning with Slot Attention

An architectural component that interfaces with perceptual representations such as the output of a convolutional neural network and produces a set of task-dependent abstract representations which are exchangeable and can bind to any object in the input by specializing through a competitive procedure over multiple rounds of attention is presented.