Retrospectives on the Embodied AI Workshop

  title={Retrospectives on the Embodied AI Workshop},
  author={Matt Deitke and Dhruv Batra and Yonatan Bisk and Tommaso Campari and Angel X. Chang and Devendra Singh Chaplot and Changan Chen and Claudia P'erez D'Arpino and Kiana Ehsani and Ali Farhadi and Li Fei-Fei and Anthony Francis and Chuang Gan and Kristen Grauman and David Hall and Winson Han and Unnat Jain and Aniruddha Kembhavi and Jacob Krantz and Stefan Lee and Chengshu Li and Sagnik Majumder and Oleksandr Maksymets and Roberto Mart'in-Mart'in and Roozbeh Mottaghi and Sonia Raychaudhuri and Mike Roberts and Silvio Savarese and Manolis Savva and Mohit Shridhar and Niko Sunderhauf and Andrew Szot and Ben Talbot and Joshua B. Tenenbaum and Jesse Thomason and Alexander Toshev and Joanne Truong and Luca Weihs and Jiajun Wu},
We present a retrospective on the state of Embodied AI research. Our analysis focuses on 13 challenges presented at the Embodied AI Workshop at CVPR. These challenges are grouped into three themes: (1) visual navigation, (2) rearrangement, and (3) embodied vision-and-language. We discuss the dominant datasets within each theme, evaluation metrics for the challenges, and the performance of state-of-the-art models. We highlight commonalities between top approaches to the challenges and identify… 

Exploiting Socially-Aware Tasks for Embodied Social Navigation

An end-to-end architecture that exploits Socially-Aware Tasks to inject into a reinforcement learning navigation policy the ability to infer common-sense social behaviors, and an evaluation protocol designed for the Social Navigation Task in simulated environments is proposed.



Habitat-Web: Learning Embodied Object-Search Strategies from Human Demonstrations at Scale

A large-scale study of imitating human demonstrations on tasks that require a virtual robot to search for objects in new environments - ObjectGoal Navigation and Pick&place - finds the IL-trained agent learns efficient object-search behavior from humans.

The Robotic Vision Scene Understanding Challenge

A new robot vision scene understanding challenge using simulation to enable repeatable experiments with active robot agency and drive state-of-the-art research in scene understanding through enabling evaluation and comparison of active robotic vision systems.

SoundSpaces: Audio-Visual Navigation in 3D Environments

This work proposes a multi-modal deep reinforcement learning approach to train navigation policies end-to-end from a stream of egocentric audio-visual observations, allowing the agent to discover elements of the geometry of the physical space indicated by the reverberating audio and detect and follow sound-emitting targets.

RoboTHOR: An Open Simulation-to-Real Embodied AI Platform

RoboTHOR offers a framework of simulated environments paired with physical counterparts to systematically explore and overcome the challenges of simulation-to-real transfer, and a platform where researchers across the globe can remotely test their embodied models in the physical world.

Room-Across-Room: Multilingual Vision-and-Language Navigation with Dense Spatiotemporal Grounding

The size, scope and detail of Room-Across-Room (RxR) dramatically expands the frontier for research on embodied language agents in simulated, photo-realistic environments.

BenchBot environments for active robotics (BEAR): Simulated data for active scene understanding research

This work presents a platform to foster research in active scene understanding, consisting of high-fidelity simulated environments and a simple yet powerful API that controls a mobile robot in simulation and reality, and provides three levels of robot agency.

Simple but Effective: CLIP Embeddings for Embodied AI

One of the baselines is extended, producing an agent capable of zero-shot object navigation that can navigate to objects that were not used as targets during training, and it beats the winners of the 2021 Habitat ObjectNav Challenge, which employ auxiliary tasks, depth maps, and human demonstrations, and those of the 2019 Habitat PointNav Challenge.

TEACh: Task-driven Embodied Agents that Chat

TEACh, a dataset of over 3,000 human-human, interactive dialogues to complete household tasks in simulation, is introduced and initial models' abilities in dialogue understanding, language grounding, and task execution are evaluated.

MultiON: Benchmarking Semantic Map Memory using Multi-Object Navigation

This work proposes the multiON task, which requires navigation to an episode-specific sequence of objects in a realistic environment and generalizes the ObjectGoal navigation task and explicitly tests the ability of navigation agents to locate previously observed goal objects.

Interactive Gibson Benchmark: A Benchmark for Interactive Navigation in Cluttered Environments

This work presents the first comprehensive benchmark for training and evaluating Interactive Navigation solutions, and presents and evaluates multiple learning-based baselines in Interactive Gibson Benchmark, and provides insights into regimes of navigation with different trade-offs between navigation, path efficiency and disturbance of surrounding objects.