VirtualHome: Simulating Household Activities Via Programs

  title={VirtualHome: Simulating Household Activities Via Programs},
  author={Xavier Puig and Kevin Kyunghwan Ra and Marko Boben and Jiaman Li and Tingwu Wang and Sanja Fidler and Antonio Torralba},
  journal={2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  • Xavier Puig, K. Ra, A. Torralba
  • Published 1 June 2018
  • Computer Science
  • 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition
In this paper, we are interested in modeling complex activities that occur in a typical household. We propose to use programs, i.e., sequences of atomic actions and interactions, as a high level representation of complex tasks. Programs are interesting because they provide a non-ambiguous representation of a task, and allow agents to execute them. However, nowadays, there is no database providing this type of information. Towards this goal, we first crowd-source programs for a variety of… 

Figures and Tables from this paper

BEHAVIOR: Benchmark for Everyday Household Activities in Virtual, Interactive, and Ecological Environments

BEHAVIOR is introduced, a benchmark for embodied AI with 100 activities in simulation, spanning a range of everyday household chores such as cleaning, maintenance, and food preparation, and a predicate logic-based description language is proposed, enabling generation of diverse instances for any activity.

Synthesizing Environment-Aware Activities via Activity Sketches

This work builds upon VirtualHome, to create a new dataset VirtualHome-Env, where it collects program sketches to represent activities and match programs with environments that can afford them, and proposes RNN-ResActGraph, a network that generates a program from a given sketch and an environment graph and tracks the changes in the environment induced by the program.

RFUniverse: A Physics-based Action-centric Interactive Environment for Everyday Household Tasks

A novel physics-based actioncentric environment, RFUniverse, is proposed for robot learning of everyday household tasks, which supports interactions among 87 atomic actions and 8 basic object types in a visually and physically plausible way.

Let’s Play for Action: Recognizing Activities of Daily Living by Learning from Life Simulation Video Games

This work explores the concept of constructing training examples for ADL recognition by playing life simulation video games and introduces the SIMS4ACTION dataset, which is accompanied with a GAMING→REAL benchmark, where the models are evaluated on real videos derived from an existing ADL dataset.

iGibson 2.0: Object-Centric Simulation for Robot Learning of Everyday Household Tasks

The new capabilities of iGibson 2.0 are evaluated to enable robot learning of novel tasks, in the hope of demonstrating the potential of this new simulator to support new research in embodied AI.

Shaping embodied agent behavior with activity-context priors from egocentric video

This work introduces an approach to discover activitycontext priors from in-the-wild egocentric video captured with human worn cameras, encoding the video-based prior as an auxiliary reward function that encourages an agent to bring compatible objects together before attempting an interaction.

Shaping embodied agent behavior with activity-context priors from egocentric video

This work introduces an approach to discover activitycontext priors from in-the-wild egocentric video captured with human worn cameras, encoding the video-based prior as an auxiliary reward function that encourages an agent to bring compatible objects together before attempting an interaction.

ToolTango: Common sense Generalization in Predicting Sequential Tool Interactions for Robot Plan Synthesis

This work takes a step in the direction of enabling robots to rapidly synthesize robust plans for complex tasks, particularly in novel settings by augmenting the representation of the environment with pre-trained embeddings derived from a knowledge-base, which can generalize effectively to novel environments.

Learning Program Representations for Food Images and Cooking Recipes

This paper builds a model that is trained to learn a joint embedding between recipes and food images via self-supervision and jointly generate a program from this embedding as a sequence and crowdsource programs for cooking recipes show that projecting the image-recipe embeddings into programs leads to better cross-modal retrieval results.

Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents

This paper investigates the possibility of grounding high-level tasks, expressed in natural language, to a chosen set of actionable steps and proposes a procedure that conditions on existing demonstrations and semantically translates the plans to admissible actions.



Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding

This work proposes a novel Hollywood in Homes approach to collect data, collecting a new dataset, Charades, with hundreds of people recording videos in their own homes, acting out casual everyday activities, and evaluates and provides baseline results for several tasks including action recognition and automatic description generation.

Unsupervised Learning from Narrated Instruction Videos

A new unsupervised learning approach that takes advantage of the complementary nature of the input video and the associated narration to solve two clustering problems, one in text and one in video, and can automatically discover the main steps to achieve the task and locate the steps in the input videos.

HoME: a Household Multimodal Environment

HoME is an open-source, OpenAI Gym-compatible platform extensible to tasks in reinforcement learning, language grounding, sound-based navigation, robotics, multi-agent learning, and more that better enables artificial agents to learn as humans do: in an interactive, multimodal, and richly contextualized setting.

Listen, Attend, and Walk: Neural Mapping of Navigational Instructions to Action Sequences

This work introduces a multi-level aligner that empowers the alignment-based encoder-decoder model with long short-term memory recurrent neural networks (LSTM-RNN) to translate natural language instructions to action sequences based upon a representation of the observable world state.

Building Generalizable Agents with a Realistic and Rich 3D Environment

House3D is built, a rich, extensible and efficient environment that contains 45,622 human-designed 3D scenes of houses, equipped with a diverse set of fully labeled 3D objects, textures and scene layouts, based on the SUNCG dataset and an emphasis on semantic-level generalization.

Everything robots always wanted to know about housework (but were afraid to ask)

  • D. NygaM. Beetz
  • Computer Science
    2012 IEEE/RSJ International Conference on Intelligent Robots and Systems
  • 2012
This paper introduces the concept of Probabilistic Robot Action Cores (PRAC) that are well-suited for encoding such knowledge in a probabilistic first-order knowledge base and shows how such a knowledge base can be acquired by natural language.

Walk the Talk: Connecting Language, Knowledge, and Action in Route Instructions

MARCO, an agent that follows free-form, natural language route instructions by representing and executing a sequence of compound action specifications that model which actions to take under which conditions, is presented.

Understanding Natural Language Commands for Robotic Navigation and Mobile Manipulation

A new model for understanding natural language commands given to autonomous systems that perform navigation and mobile manipulation in semi-structured environments that dynamically instantiates a probabilistic graphical model for a particular natural language command according to the command's hierarchical and compositional semantic structure.

Target-driven visual navigation in indoor scenes using deep reinforcement learning

This paper proposes an actor-critic model whose policy is a function of the goal as well as the current state, which allows better generalization and proposes the AI2-THOR framework, which provides an environment with high-quality 3D scenes and a physics engine.

Manipulation action tree bank: A knowledge resource for humanoids

It is believed that tree banks are an effective and practical way to organize semantic structures of manipulation actions for humanoids applications and could be used as basis for automatic manipulation action understanding and execution and reasoning and prediction during both observation and execution.