Connecting Touch and Vision via Cross-Modal Prediction

  title={Connecting Touch and Vision via Cross-Modal Prediction},
  author={Yunzhu Li and Jun-Yan Zhu and Russ Tedrake and Antonio Torralba},
  journal={2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  • Yunzhu Li, Jun-Yan Zhu, A. Torralba
  • Published 1 June 2019
  • Computer Science
  • 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Humans perceive the world using multi-modal sensory inputs such as vision, audition, and touch. In this work, we investigate the cross-modal connection between vision and touch. The main challenge in this cross-domain modeling task lies in the significant scale discrepancy between the two: while our eyes perceive an entire visual scene at once, humans can only feel a small region of an object at any given moment. To connect vision and touch, we introduce new tasks of synthesizing plausible… 

Figures and Tables from this paper

Visual-Tactile Cross-Modal Data Generation Using Residue-Fusion GAN With Feature-Matching and Perceptual Losses

This letter proposes a deep-learning-based approach for cross-modal visual-tactile data generation by leveraging the framework of the generative adversarial networks (GANs) by adopting the conditional-GAN (cGAN) structure together with the residue-fusion (RF) module, and training the model with the additional feature-matching (FM) and perceptual losses to achieve the cross- modal data generation.

Play it by Ear: Learning Skills amidst Occlusion through Audio-Visual Imitation Learning

A system that can complete a set of challenging, partially-observed tasks on a Franka Emika Panda robot, like extracting keys from a bag, with a 70% success rate, 50% higher than a policy that does not use audio.

Cross-Modal Environment Self-Adaptation During Object Recognition in Artificial Cognitive Systems

A cross-modal learning transfer mechanism capable of detecting both sudden and permanent anomalies in the visual channel and still maintain visual object recognition performance by retraining the visual mode for a few minutes using haptic information is created.

Multimodal perception for dexterous manipulation

A spatio-temporal attention model for tactile texture recognition, which takes both spatial features and time dimension into consideration, is proposed, which not only pays attention to the salient features in each spatial feature, but also models the temporal correlation in the through the time.

Generative Partial Visual-Tactile Fused Object Clustering

A conditional cross-modal clustering generative adversarial network is developed to synthesize one modality conditioning on the other modality, which can compensate missing samples and align the visual and tactile modalities naturally by adversarial learning.

Learning Intuitive Physics with Multimodal Generative Models

This paper presents a perception framework that fuses visual and tactile feedback to make predictions about the expected motion of objects in dynamic scenes, using a novel See-Through-your-Skin sensor that provides high resolution multimodal sensing of contact surfaces.

Toward Image-to-Tactile Cross-Modal Perception for Visually Impaired People

A new generative adversarial network (GAN) model is developed to effectively transform the ground images into the tactile signal, which can be displayed by an off-the-shelf vibration device, which provides visually impaired people with effective surrounding perception capability.

Teaching Cameras to Feel: Estimating Tactile Physical Properties of Surfaces From Images

This work introduces the challenging task of estimating a set of tactile physical properties from visual information and develops a cross-modal framework comprised of an adversarial objective and a novel visuo-tactile joint classification loss.

Multi-modal self-adaptation during object recognition in an artificial cognitive system

This work creates a multimodal learning transfer mechanism capable of both detecting sudden and permanent anomalies in the visual channel and maintaining visual object recognition performance by retraining the visual mode for a few minutes using haptic information.



Learning the signatures of the human grasp using a scalable tactile glove

Tactile patterns obtained from a scalable sensor-embedded glove and deep convolutional neural networks help to explain how the human hand can identify and grasp individual objects and estimate their weights.

Connecting Look and Feel: Associating the Visual and Tactile Properties of Physical Materials

This work captures color and depth images of draped fabrics along with tactile data from a high-resolution touch sensor and seeks to associate the information from vision and touch by jointly training CNNs across the three modalities.

Making Sense of Vision and Touch: Self-Supervised Learning of Multimodal Representations for Contact-Rich Tasks

This work uses self-supervision to learn a compact and multimodal representation of sensory inputs, which can then be used to improve the sample efficiency of the policy learning of deep reinforcement learning algorithms.

Scribbler: Controlling Deep Image Synthesis with Sketch and Color

A deep adversarial image synthesis architecture that is conditioned on sketched boundaries and sparse color strokes to generate realistic cars, bedrooms, or faces is proposed and demonstrates a sketch based image synthesis system which allows users to scribble over the sketch to indicate preferred color for objects.

Cross-Modal Scene Networks

The experiments suggest that the scene representation can help transfer representations across modalities for retrieval and the visualizations suggest that units emerge in the shared representation that tend to activate on consistent concepts independently of the modality.

The Feeling of Success: Does Touch Sensing Help Predict Grasp Outcomes?

This work investigated the question of whether touch sensing aids in predicting grasp outcomes within a multimodal sensing framework that combines vision and touch, and evaluated visuo-tactile deep neural network models to directly predict grasp outcomes from either modality individually, and from both modalities together.

High-Resolution Image Synthesis and Semantic Manipulation with Conditional GANs

A new method for synthesizing high-resolution photo-realistic images from semantic label maps using conditional generative adversarial networks (conditional GANs) is presented, which significantly outperforms existing methods, advancing both the quality and the resolution of deep image synthesis and editing.

Multimodal Deep Learning

This work presents a series of tasks for multimodal learning and shows how to train deep networks that learn features to address these tasks, and demonstrates cross modality feature learning, where better features for one modality can be learned if multiple modalities are present at feature learning time.

Shape-independent hardness estimation using deep learning and a GelSight tactile sensor

This work introduces a novel method for hardness estimation, based on the GelSight tactile sensor, and it is shown that the neural net model can estimate the hardness of objects with different shapes and hardness ranging from 8 to 87 in Shore 00 scale.

Colorful Image Colorization

This paper proposes a fully automatic approach to colorization that produces vibrant and realistic colorizations and shows that colorization can be a powerful pretext task for self-supervised feature learning, acting as a cross-channel encoder.