ProcTHOR: Large-Scale Embodied AI Using Procedural Generation

  title={ProcTHOR: Large-Scale Embodied AI Using Procedural Generation},
  author={Matt Deitke and Eli VanderBilt and Alvaro Herrasti and Luca Weihs and Jordi Salvador and Kiana Ehsani and Winson Han and Eric Kolve and Ali Farhadi and Aniruddha Kembhavi and Roozbeh Mottaghi},
Massive datasets and high-capacity models have driven many recent advancements in computer vision and natural language understanding. This work presents a platform to enable similar success stories in Embodied AI. We propose ProcTHOR, a framework for procedural generation of Embodied AI environments. ProcTHOR enables us to sample arbitrarily large datasets of diverse, interactive, customizable, and performant virtual environments to train and evaluate embodied agents across navigation… 

Scene Augmentation Methods for Interactive Embodied AI Tasks

A scene augmentation strategy to scale up the scene diversity for interactive tasks and make the interactions more like the real world and a systematic generalization analysis is presented using the proposed methods to explicitly estimate the ability of agents to generalize to new layouts, new objects, and new object states.

GenAug: Retargeting behaviors to unseen situations via Generative Augmentation

This work shows how pre-trained generative models can serve as effective tools for semantically meaningful data augmentation as well as generating appropriate"semantic" data augmentations, and proposes a system GenAug that is able to significantly improve policy generalization.

VIMA: General Robot Manipulation with Multimodal Prompts

It is shown that a wide spectrum of robot manipulation tasks can be expressed with multimodal prompts, interleaving textual and visual tokens, and designed a transformer-based robot agent, VIMA, that processes these prompts and outputs motor actions autoregressively.

Where are we in the search for an Artificial Visual Cortex for Embodied Intelligence?

The largest and most comprehensive empirical study of pre-trained visual representations (PVRs) or visual 'foundation models' for Embodied AI is presented, and it is found that scaling dataset size and diversity does not improve performance universally (but does so on average).

A General Purpose Supervisory Signal for Embodied Agents

The Scene Graph Contrastive (SGC) loss is proposed, which uses scene graphs as general-purpose, training-only, supervisory signals, and uses contrastive learning to align an agent's representation with a rich graphical encoding of its environment.

Phone2Proc: Bringing Robust Robots Into Our Chaotic World

Phone2Proc, a method that uses a 10-minute phone scan and conditional procedural generation to create a distribution of training scenes that are semantically similar to the target environment, makes agents remarkably robust to changes in the real world, such as human movement, object rearrangement, lighting changes, or clutter.

CoWs on Pasture: Baselines and Benchmarks for Language-Driven Zero-Shot Object Navigation

This work investigates a straightforward framework, CLIP on Wheels (CoW), to adapt open-vocabulary models to this task without Tuning, and introduces the P ASTURE benchmark, which considers uncommon objects, objects described by spatial and appearance attributes, and hidden objects described rel-ative to visible objects.

Human-Timescale Adaptation in an Open-Ended Task Space

It is demonstrated that training an RL agent at scale leads to a general in-context learning algorithm that can adapt to open-ended novel embodied 3D problems as quickly as humans.

Towards A Foundation Model for Generalist Robots: Diverse Skill Learning at Scale via Automated Task and Scene Generation

This document presents a specific idea for mining knowledge in the latest large-scale foundation models for robotics research, and advocates for using them to generate diversified tasks and scenes at scale, thereby scaling up low-level skill learning and ultimately leading to a foundation model for robotics that empowers generalist robots.

Objaverse: A Universe of Annotated 3D Objects

The large potential of Objaverse is demonstrated via four diverse applications: training generative 3D models, improving tail category segmentation on the LVIS benchmark, training open-vocabulary object-navigation models for Embodied AI, and creating a new benchmark for robustness analysis of vision models.

AllenAct: A Framework for Embodied AI Research

AllenAct is introduced, a modular and flexible learning framework designed with a focus on the unique requirements of Embodied AI research that provides first-class support for a growing collection of embodied environments, tasks and algorithms.

RoboTHOR: An Open Simulation-to-Real Embodied AI Platform

RoboTHOR offers a framework of simulated environments paired with physical counterparts to systematically explore and overcome the challenges of simulation-to-real transfer, and a platform where researchers across the globe can remotely test their embodied models in the physical world.

Simple but Effective: CLIP Embeddings for Embodied AI

One of the baselines is extended, producing an agent capable of zero-shot object navigation that can navigate to objects that were not used as targets during training, and it beats the winners of the 2021 Habitat ObjectNav Challenge, which employ auxiliary tasks, depth maps, and human demonstrations, and those of the 2019 Habitat PointNav Challenge.

Habitat: A Platform for Embodied AI Research

The comparison between learning and SLAM approaches from two recent works are revisited and evidence is found -- that learning outperforms SLAM if scaled to an order of magnitude more experience than previous investigations, and the first cross-dataset generalization experiments are conducted.

Flamingo: a Visual Language Model for Few-Shot Learning

This work introduces Flamingo, a family of Visual Language Models (VLM) with this ability to bridge powerful pretrained vision-only and language-only models, handle sequences of arbitrarily interleaved visual and textual data, and seamlessly ingest images or videos as inputs.

ManipulaTHOR: A Framework for Visual Object Manipulation

This work proposes a framework for object manipulation built upon the physics-enabled, visually rich AI2-THOR framework and presents a new challenge to the Embodied AI community known as ArmPointNav, which extends the popular point navigation task to object manipulation and offers new challenges including 3D obstacle avoidance.

Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents

This paper investigates the possibility of grounding high-level tasks, expressed in natural language, to a chosen set of actionable steps and proposes a procedure that conditions on existing demonstrations and semantically translates the plans to admissible actions.

Visual Room Rearrangement

The experiments show that solving this challenging interactive task that involves navigation and object interaction is beyond the capabilities of the current state-of-the-art techniques for embodied tasks and the authors are still very far from achieving perfect performance on these types of tasks.

3D Neural Scene Representations for Visuomotor Control

This work shows that a dynamics model, constructed over the learned representation space, enables visuomotor control for challenging manipulation tasks involving both rigid bodies and fluids, where the target is specified in a viewpoint different from what the robot operates on.

RLBench: The Robot Learning Benchmark & Learning Environment

This large-scale benchmark aims to accelerate progress in a number of vision-guided manipulation research areas, including: reinforcement learning, imitation learning, multi-task learning, geometric computer vision, and in particular, few-shot learning.