Learning Intuitive Physics with Multimodal Generative Models

  title={Learning Intuitive Physics with Multimodal Generative Models},
  author={S. Rezaei-Shoshtari and Francois Robert Hogan and Michael R. M. Jenkin and David Meger and Gregory Dudek},
Predicting the future interaction of objects when they come into contact with their environment is key for autonomous agents to take intelligent and anticipatory actions. This paper presents a perception framework that fuses visual and tactile feedback to make predictions about the expected motion of objects in dynamic scenes. Visual information captures object properties such as 3D shape and location, while tactile information provides critical cues about interaction forces and resulting… 

Figures and Tables from this paper

Learning Sequential Latent Variable Models from Multimodal Time Series Data

This work presents a self-supervised generative modelling framework to jointly learn a probabilistic latent state representation of multimodal data and the respective dynamics, and demonstrates that this method is nearly as effective as an existing supervised approach that relies on ground truth labels.

Geometric multimodal representation learning

This work surveys 140 studies in graph-centric AI and puts forward an algorithmic blueprint for multimodal graph learning based on this categorization, which serves as a way to group state-of-the-art architectures that treat multimodAL data by choosing appropriately four different components.



Galileo: Perceiving Physical Object Properties by Integrating a Physics Engine with Deep Learning

This study points towards an account of human vision with generative physical knowledge at its core, and various recognition models as helpers leading to efficient inference.

Connecting Touch and Vision via Cross-Modal Prediction

This work investigates the cross-modal connection between vision and touch with a new conditional adversarial model that incorporates the scale and location information of the touch and demonstrates that the model can produce realistic visual images from tactile data and vice versa.

Making Sense of Vision and Touch: Self-Supervised Learning of Multimodal Representations for Contact-Rich Tasks

This work uses self-supervision to learn a compact and multimodal representation of sensory inputs, which can then be used to improve the sample efficiency of the policy learning of deep reinforcement learning algorithms.

Multi-Modal Geometric Learning for Grasping and Manipulation

This work provides an architecture that incorporates depth and tactile information to create rich and accurate 3D models useful for robotic manipulation tasks through the use of a 3D convolutional neural network (CNN).

Multimodal dynamics modeling for off-road autonomous vehicles

This study designs a model capable of long-horizon motion predictions, leveraging vision, lidar and proprioception, which is robust to arbitrarily missing modalities at test time, and demonstrates the importance of leveraging multiple sensors when doing dynamics modeling in outdoor conditions.

More Than a Feeling: Learning to Grasp and Regrasp Using Vision and Touch

An end-to-end action-conditional model that learns regrasping policies from raw visuo-tactile data and outperforms a variety of baselines at estimating grasp adjustment outcomes, selecting efficient grasp adjustments for quick grasping, and reducing the amount of force applied at the fingers, while maintaining competitive performance.

Seeing Through your Skin: Recognizing Objects with a Novel Visuotactile Sensor

The ability of the See-Through-your-Skin sensor to classify household objects, recognize fine textures, and infer their physical properties both through numerical simulations and experiments with a smart countertop prototype are validated.

“Touching to See” and “Seeing to Feel”: Robotic Cross-modal Sensory Data Generation for Visual-Tactile Perception

A novel framework for the cross-modal sensory data generation for visual and tactile perception by applying conditional generative adversarial networks to generate pseudo visual images or tactile outputs from data of the other modality is proposed.

Connecting Look and Feel: Associating the Visual and Tactile Properties of Physical Materials

This work captures color and depth images of draped fabrics along with tactile data from a high-resolution touch sensor and seeks to associate the information from vision and touch by jointly training CNNs across the three modalities.

3D Shape Perception from Monocular Vision, Touch, and Shape Priors

This paper uses vision first, applying neural networks with learned shape priors to predict an object's 3D shape from a single-view color image, and then uses tactile sensing to refine the shape; the robot actively touches the object regions where the visual prediction has high uncertainty.