• Corpus ID: 235765661

3D Neural Scene Representations for Visuomotor Control

@article{Li20213DNS,
  title={3D Neural Scene Representations for Visuomotor Control},
  author={Yunzhu Li and Shuang Li and Vincent Sitzmann and Pulkit Agrawal and Antonio Torralba},
  journal={ArXiv},
  year={2021},
  volume={abs/2107.04004}
}
Humans have a strong intuitive understanding of the 3D environment around us. The mental model of the physics in our brain applies to objects of different materials and enables us to perform a wide range of manipulation tasks that are far beyond the reach of current robots. In this work, we desire to learn models for dynamic 3D scenes purely from 2D visual observations. Our model combines Neural Radiance Fields (NeRF) and time contrastive learning with an autoencoding framework, which learns… 

Dynamical Scene Representation and Control with Keypoint-Conditioned Neural Radiance Field

TLDR
A method that can learn to model dynamic and arbitrary 3D scenes, purely from 2D visual observations, using a keypoint-conditioned Neural Radiance Field (KP-NeRF) to capture and model these scenes with the overarching goal of supporting image-based robot manipulation.

Learning Multi-Object Dynamics with Compositional Neural Radiance Fields

TLDR
A key feature of this approach is that the learned 3D information of the scene through the NeRF model enables us to incorporate structural priors in learning the dynamics models, making long-term predictions more stable.

K-VIL: Keypoints-based Visual Imitation Learning

TLDR
An approach that automatically extracts sparse, object-centric, and embodiment-independent task representations from a small number of human demonstration videos and introduces a novel keypoint-based admittance controller to reproduce manipulation skills using the learned set of prioritized geometric constraints in novel scenes.

Neural Descriptor Fields: SE(3)-Equivariant Object Representations for Manipulation

We present Neural Descriptor Fields (NDFs), an object representation that encodes both points and relative poses between an object and a target (such as a robot gripper or a rack used for hanging)

Neural Fields in Visual Computing and Beyond

TLDR
A review of the literature on neural fields shows the breadth of topics already covered in visual computing, both historically and in current incarnations, and highlights the improved quality, flexibility, and capability brought by neural field methods.

Neural Fields in Visual Computing

TLDR
A review of the literature on neural fields shows the breadth of topics already covered in visual computing, both historically and in current incarnations, and highlights the improved quality, flexibility, and capability brought by neural field methods.

ULATIONS WITH TOOLS

TLDR
This work proposes a novel framework, named DiffSkill, that uses a differentiable physics simulator for skill abstraction to solve long-horizon deformable object manipulation tasks from sensory observations and shows the advantages of the method in a new set of sequential deformable objects manipulation tasks compared to previous reinforcement learning algorithms and compared to the trajectory optimizer.

DiffSkill: Skill Abstraction from Differentiable Physics for Deformable Object Manipulations with Tools

TLDR
This work proposes a novel framework, named DiffSkill, that uses a differentiable physics simulator for skill abstraction to solve long-horizon deformable object manipulation tasks from sensory observations and shows the advantages of the method in a new set of sequential deformable objects manipulation tasks compared to previous reinforcement learning algorithms and compared to the trajectory optimizer.

ProcTHOR: Large-Scale Embodied AI Using Procedural Generation

TLDR
The proposed PROCTHOR, a framework for procedural generation of Embodied AI environments, enables us to sample arbitrarily large datasets of diverse, interactive, customizable, and performant virtual environments to train and evaluate embodied agents across navigation, interaction, and manipulation tasks.

Learning Continuous Grasping Function with a Dexterous Hand from Human Demonstrations

TLDR
Compared to previous planning algorithms, the proposed Continuous Grasping Function is more efficient and achieves significant improvement on success rate when transferred to grasping with the real Allegro Hand.

References

SHOWING 1-10 OF 63 REFERENCES

3D-OES: Viewpoint-Invariant Object-Factorized Environment Simulators

TLDR
An action-conditioned dynamics model that predicts scene changes caused by object and agent interactions in a viewpoint-invariant 3D neural scene representation space, inferred from RGB-D videos, outperforming existing 2D and 3D dynamics models.

Visual Grounding of Learned Physical Models

TLDR
A neural model that simultaneously reasons about physics and makes future predictions based on visual and dynamics priors is presented that can infer the physical properties within a few observations, which allows the model to quickly adapt to unseen scenarios and make accurate predictions into the future.

Learning to Poke by Poking: Experiential Learning of Intuitive Physics

TLDR
A novel approach based on deep neural networks is proposed for modeling the dynamics of robot's interactions directly from images, by jointly estimating forward and inverse models of dynamics.

Keypoints into the Future: Self-Supervised Correspondence in Model-Based Reinforcement Learning

TLDR
This work introduces model-based prediction with self-supervised visual correspondence learning, and shows that not only is this indeed possible, but these types of predictive models show compelling performance improvements over alternative methods for vision-based RL with autoencoder-type vision training.

GRF: Learning a General Radiance Field for 3D Scene Representation and Rendering

TLDR
The key to the approach is to explicitly integrate the principle of multi- view geometry to obtain the internal representations from observed 2D views, guaranteeing the learned implicit representations meaningful and multi-view consistent.

Neural Radiance Flow for 4D View Synthesis and Video Processing

TLDR
This work uses a neural implicit representation that learns to capture the 3D occupancy, radiance, and dynamics of the scene, and demonstrates that the learned representation can serve as an implicit scene prior, enabling video processing tasks such as image super-resolution and de-noising without any additional supervision.

Visual Foresight: Model-Based Deep Reinforcement Learning for Vision-Based Robotic Control

TLDR
It is demonstrated that visual MPC can generalize to never-before-seen objects---both rigid and deformable---and solve a range of user-defined object manipulation tasks using the same model.

Self-Supervised Visual Planning with Temporal Skip Connections

TLDR
This work introduces a video prediction model that can keep track of objects through occlusion by incorporating temporal skip-connections and demonstrates that this model substantially outperforms prior work on video prediction-based control.

The Surprising Effectiveness of Linear Models for Visual Foresight in Object Pile Manipulation

TLDR
A linear model works surprisingly well for pushing piles of small objects into a desired target set using visual feedback, achieving better prediction error, downstream task performance, and generalization to new environments than the deep models trained on the same amount of data.

Neural scene representation and rendering

TLDR
The Generative Query Network (GQN) is introduced, a framework within which machines learn to represent scenes using only their own sensors, demonstrating representation learning without human labels or domain knowledge.
...