Visual Room Rearrangement

  title={Visual Room Rearrangement},
  author={Luca Weihs and Matt Deitke and Aniruddha Kembhavi and Roozbeh Mottaghi},
  journal={2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
There has been a significant recent progress in the field of Embodied AI with researchers developing models and algorithms enabling embodied agents to navigate and interact within completely unseen environments. In this paper, we propose a new dataset and baseline models for the task of Rearrangement. We particularly focus on the task of Room Rearrangement: an agent begins by exploring a room and recording objects’ initial configurations. We then remove the agent and change the poses and states… 

Figures and Tables from this paper

Learning to Explore, Navigate and Interact for Visual Room Rearrangement

A three-phased modular architecture (TMA) for visual room rearrangement that maximizes the performance by placing the learning modules along with hand-crafted feature engineering modules—retaining the advantage of learning while reducing the cost of learning.

A Simple Approach for Visual Rearrangement: 3D Mapping and Semantic Search

This work proposes a simple yet effective method to search for and map which objects need to be rearranged, and rearrange each object until the task is complete, which improves on current state-of-the-art end-to-end reinforcement learning-based methods that learn visual rearrangement policies.

Semantically Grounded Object Matching for Robust Robotic Scene Rearrangement

This work presents a novel approach to object matching that uses a large pre-trained vision-language model to match objects in a cross-instance setting by leveraging semantics together with visual features as a more robust, and much more general, measure of similarity.

Task and Motion Planning with Large Language Models for Object Rearrangement

LLM-GROP is proposed, which uses prompting to extract commonsense knowledge about semantically valid object configurations from an LLM and instantiates them with a task and motion planner in order to generalize to varying scene geometry.

Scene Augmentation Methods for Interactive Embodied AI Tasks

A scene augmentation strategy to scale up the scene diversity for interactive tasks and make the interactions more like the real world and a systematic generalization analysis is presented using the proposed methods to explicitly estimate the ability of agents to generalize to new layouts, new objects, and new object states.

Object Manipulation via Visual Target Localization

This work proposes Manipulation via Visual Object Location Estimation (m-VOLE), an approach that explores the environment in search for target objects, computes their 3D coordinates once they are located, and then continues to estimate their3D locations even when the objects are not visible, thus robustly aiding the task of manipulating these objects throughout the episode.

Learning by Asking for Embodied Visual Navigation and Task Completion

An Embodied Learning-By-Asking (ELBA) model that learns when and what questions to ask to dynamically acquire additional information for completing the task is proposed.

Continuous Scene Representations for Embodied AI

Using CSR, state-of-the-art approaches for the challenging downstream task of visual room rearrangement are outperformed, without any task specific training and the learned embeddings capture salient spatial details of the scene and show applicability to real world data.

Effective Baselines for Multiple Object Rearrangement Planning in Partially Observable Mapped Environments

It is shown that greedy modular agents are empirically optimal when the objects that need to be rearranged are uniformly distributed in the environment – thereby contributing baselines with strong performance for future work on multi-object rearrangement planning in partially observable settings.

Planning Large-scale Object Rearrangement Using Deep Reinforcement Learning

The proposed deep-RL based task planning method is the first one that demonstrates the rearrangement across different scenarios from 2D surfaces such as tabletops to 3D rooms with a large number of objects and without any explicit need of buffer space.

AllenAct: A Framework for Embodied AI Research

AllenAct is introduced, a modular and flexible learning framework designed with a focus on the unique requirements of Embodied AI research that provides first-class support for a growing collection of embodied environments, tasks and algorithms.

Rearrangement: A Challenge for Embodied AI

A framework for research and evaluation in Embodied AI is described, based on a canonical task: Rearrangement, that can focus the development of new techniques and serve as a source of trained models that can be transferred to other settings.

Cognitive Mapping and Planning for Visual Navigation

The Cognitive Mapper and Planner is based on a unified joint architecture for mapping and planning, such that the mapping is driven by the needs of the task, and a spatial memory with the ability to plan given an incomplete set of observations about the world.

ObjectNav Revisited: On Evaluation of Embodied Agents Navigating to Objects

This document summarizes the consensus recommendations of this working group on ObjectNav and makes recommendations on subtle but important details of evaluation criteria, the agent's embodiment parameters, and the characteristics of the environments within which the task is carried out.

Monte-Carlo Tree Search for Efficient Visually Guided Rearrangement Planning

This work introduces an efficient and scalable rearrangement planning method, based on a Monte-Carlo Tree Search exploration strategy, and develops an integrated approach for robust multi-object workspace state estimation from a single uncalibrated RGB camera using a deep neural network trained only with synthetic data.

IQA: Visual Question Answering in Interactive Environments

The Hierarchical Interactive Memory Network (HIMN), consisting of a factorized set of controllers, allowing the system to operate at multiple levels of temporal abstraction, is proposed, and outperforms popular single controller based methods on IQUAD V1.

Occupancy Anticipation for Efficient Exploration and Navigation

This work proposes occupancy anticipation, where the agent uses its egocentric RGB-D observations to infer the occupancy state beyond the visible regions, which facilitates efficient exploration and navigation in 3D environments.

Pick and Place Without Geometric Object Models

This approach can solve a challenging class of pick-place and regrasping problems where the exact geometry of the objects to be handled is unknown and shows a major improvement relative to a shape primitives baseline.

Two Body Problem: Collaborative Visual Task Completion

This paper studies the problem of learning to collaborate directly from pixels in AI2-THOR and demonstrates the benefits of explicit and implicit modes of communication to perform visual tasks.

Object Goal Navigation using Goal-Oriented Semantic Exploration

A modular system called, `Goal-Oriented Semantic Exploration' which builds an episodic semantic map and uses it to explore the environment efficiently based on the goal object category and outperforms a wide range of baselines including end-to-end learning-based methods as well as modular map- based methods.