ProcTHOR: Large-Scale Embodied AI Using Procedural Generation
@article{Deitke2022ProcTHORLE, title={ProcTHOR: Large-Scale Embodied AI Using Procedural Generation}, author={Matt Deitke and Eli VanderBilt and Alvaro Herrasti and Luca Weihs and Jordi Salvador and Kiana Ehsani and Winson Han and Eric Kolve and Ali Farhadi and Aniruddha Kembhavi and Roozbeh Mottaghi}, journal={ArXiv}, year={2022}, volume={abs/2206.06994} }
Massive datasets and high-capacity models have driven many recent advancements in computer vision and natural language understanding. This work presents a platform to enable similar success stories in Embodied AI. We propose ProcTHOR, a framework for procedural generation of Embodied AI environments. ProcTHOR enables us to sample arbitrarily large datasets of diverse, interactive, customizable, and performant virtual environments to train and evaluate embodied agents across navigation…
Figures and Tables from this paper
figure 1 table 1 figure 2 table 2 figure 3 table 3 figure 4 table 4 figure 5 table 5 figure 6 table 6 figure 7 table 7 figure 8 table 8 figure 9 table 9 figure 10 table 10 figure 11 table 11 figure 12 figure 13 figure 14 figure 15 figure 16 figure 17 figure 18 figure 19 figure 20 figure 21 figure 22 figure 23 figure 24 figure 25 figure 26 figure 27 figure 28 figure 29 figure 30 figure 31 figure 32 figure 33 figure 34 figure 35 figure 36 figure 37 figure 38 figure 39
37 Citations
Scene Augmentation Methods for Interactive Embodied AI Tasks
- Computer ScienceIEEE Transactions on Instrumentation and Measurement
- 2023
A scene augmentation strategy to scale up the scene diversity for interactive tasks and make the interactions more like the real world and a systematic generalization analysis is presented using the proposed methods to explicitly estimate the ability of agents to generalize to new layouts, new objects, and new object states.
GenAug: Retargeting behaviors to unseen situations via Generative Augmentation
- Computer ScienceArXiv
- 2023
This work shows how pre-trained generative models can serve as effective tools for semantically meaningful data augmentation as well as generating appropriate"semantic" data augmentations, and proposes a system GenAug that is able to significantly improve policy generalization.
VIMA: General Robot Manipulation with Multimodal Prompts
- Computer ScienceArXiv
- 2022
It is shown that a wide spectrum of robot manipulation tasks can be expressed with multimodal prompts, interleaving textual and visual tokens, and designed a transformer-based robot agent, VIMA, that processes these prompts and outputs motor actions autoregressively.
Where are we in the search for an Artificial Visual Cortex for Embodied Intelligence?
- Computer ScienceArXiv
- 2023
The largest and most comprehensive empirical study of pre-trained visual representations (PVRs) or visual 'foundation models' for Embodied AI is presented, and it is found that scaling dataset size and diversity does not improve performance universally (but does so on average).
A General Purpose Supervisory Signal for Embodied Agents
- Computer ScienceArXiv
- 2022
The Scene Graph Contrastive (SGC) loss is proposed, which uses scene graphs as general-purpose, training-only, supervisory signals, and uses contrastive learning to align an agent's representation with a rich graphical encoding of its environment.
Phone2Proc: Bringing Robust Robots Into Our Chaotic World
- Computer ScienceArXiv
- 2022
Phone2Proc, a method that uses a 10-minute phone scan and conditional procedural generation to create a distribution of training scenes that are semantically similar to the target environment, makes agents remarkably robust to changes in the real world, such as human movement, object rearrangement, lighting changes, or clutter.
CoWs on Pasture: Baselines and Benchmarks for Language-Driven Zero-Shot Object Navigation
- Computer Science
- 2022
This work investigates a straightforward framework, CLIP on Wheels (CoW), to adapt open-vocabulary models to this task without Tuning, and introduces the P ASTURE benchmark, which considers uncommon objects, objects described by spatial and appearance attributes, and hidden objects described rel-ative to visible objects.
Human-Timescale Adaptation in an Open-Ended Task Space
- Computer Science, PsychologyArXiv
- 2023
It is demonstrated that training an RL agent at scale leads to a general in-context learning algorithm that can adapt to open-ended novel embodied 3D problems as quickly as humans.
Towards A Foundation Model for Generalist Robots: Diverse Skill Learning at Scale via Automated Task and Scene Generation
- Computer Science
- 2023
This document presents a specific idea for mining knowledge in the latest large-scale foundation models for robotics research, and advocates for using them to generate diversified tasks and scenes at scale, thereby scaling up low-level skill learning and ultimately leading to a foundation model for robotics that empowers generalist robots.
Objaverse: A Universe of Annotated 3D Objects
- Computer ScienceArXiv
- 2022
The large potential of Objaverse is demonstrated via four diverse applications: training generative 3D models, improving tail category segmentation on the LVIS benchmark, training open-vocabulary object-navigation models for Embodied AI, and creating a new benchmark for robustness analysis of vision models.
129 References
AllenAct: A Framework for Embodied AI Research
- Computer ScienceArXiv
- 2020
AllenAct is introduced, a modular and flexible learning framework designed with a focus on the unique requirements of Embodied AI research that provides first-class support for a growing collection of embodied environments, tasks and algorithms.
RoboTHOR: An Open Simulation-to-Real Embodied AI Platform
- Computer Science2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
- 2020
RoboTHOR offers a framework of simulated environments paired with physical counterparts to systematically explore and overcome the challenges of simulation-to-real transfer, and a platform where researchers across the globe can remotely test their embodied models in the physical world.
Simple but Effective: CLIP Embeddings for Embodied AI
- Computer Science2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
- 2022
One of the baselines is extended, producing an agent capable of zero-shot object navigation that can navigate to objects that were not used as targets during training, and it beats the winners of the 2021 Habitat ObjectNav Challenge, which employ auxiliary tasks, depth maps, and human demonstrations, and those of the 2019 Habitat PointNav Challenge.
Habitat: A Platform for Embodied AI Research
- Computer Science2019 IEEE/CVF International Conference on Computer Vision (ICCV)
- 2019
The comparison between learning and SLAM approaches from two recent works are revisited and evidence is found -- that learning outperforms SLAM if scaled to an order of magnitude more experience than previous investigations, and the first cross-dataset generalization experiments are conducted.
Flamingo: a Visual Language Model for Few-Shot Learning
- Computer ScienceNeurIPS
- 2022
This work introduces Flamingo, a family of Visual Language Models (VLM) with this ability to bridge powerful pretrained vision-only and language-only models, handle sequences of arbitrarily interleaved visual and textual data, and seamlessly ingest images or videos as inputs.
ManipulaTHOR: A Framework for Visual Object Manipulation
- Computer Science2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
- 2021
This work proposes a framework for object manipulation built upon the physics-enabled, visually rich AI2-THOR framework and presents a new challenge to the Embodied AI community known as ArmPointNav, which extends the popular point navigation task to object manipulation and offers new challenges including 3D obstacle avoidance.
Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents
- Computer ScienceICML
- 2022
This paper investigates the possibility of grounding high-level tasks, expressed in natural language, to a chosen set of actionable steps and proposes a procedure that conditions on existing demonstrations and semantically translates the plans to admissible actions.
Visual Room Rearrangement
- Computer Science2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
- 2021
The experiments show that solving this challenging interactive task that involves navigation and object interaction is beyond the capabilities of the current state-of-the-art techniques for embodied tasks and the authors are still very far from achieving perfect performance on these types of tasks.
3D Neural Scene Representations for Visuomotor Control
- Computer ScienceArXiv
- 2021
This work shows that a dynamics model, constructed over the learned representation space, enables visuomotor control for challenging manipulation tasks involving both rigid bodies and fluids, where the target is specified in a viewpoint different from what the robot operates on.
RLBench: The Robot Learning Benchmark & Learning Environment
- Computer ScienceIEEE Robotics and Automation Letters
- 2020
This large-scale benchmark aims to accelerate progress in a number of vision-guided manipulation research areas, including: reinforcement learning, imitation learning, multi-task learning, geometric computer vision, and in particular, few-shot learning.