• Publications
  • Influence
ALFRED: A Benchmark for Interpreting Grounded Instructions for Everyday Tasks
TLDR
It is shown that a baseline model based on recent embodied vision-and-language tasks performs poorly on ALFRED, suggesting that there is significant room for developing innovative grounded visual language understanding models with this benchmark.
Vision-and-Dialog Navigation
TLDR
This work introduces Cooperative Vision-and-Dialog Navigation, a dataset of over 2k embodied, human-human dialogs situated in simulated, photorealistic home environments and establishes an initial, multi-modal sequence-to-sequence model.
Integrating Language and Vision to Generate Natural Language Descriptions of Videos in the Wild
TLDR
This paper proposes a strategy for generating textual descriptions of videos by using a factor graph to combine visual detections with language statistics, and uses state-of-the-art visual recognition systems to obtain confidences on entities, activities, and scenes present in the video.
Experience Grounds Language
TLDR
It is posited that the present success of representation learning approaches trained on large text corpora can be deeply enriched from the parallel tradition of research on the contextual and social nature of language.
Learning to Interpret Natural Language Commands through Human-Robot Dialog
TLDR
This work introduces a dialog agent for mobile robots that understands human instructions through semantic parsing, actively resolves ambiguities using a dialog manager, and incrementally learns from human-robot conversations by inducing training data from user paraphrases.
Shifting the Baseline: Single Modality Performance on Visual Navigation & QA
TLDR
It is argued that unimodal approaches better capture and reflect dataset biases and therefore provide an important comparison when assessing the performance of multimodal techniques.
RMM: A Recursive Mental Model for Dialog Navigation
TLDR
This paper introduces a two-agent task where one agent navigates and asks questions that a second, guiding agent answers, and proposes the Recursive Mental Model (RMM), a model that enables better generalization to novel environments.
Learning Multi-Modal Grounded Linguistic Semantics by Playing "I Spy"
TLDR
This paper builds perceptual models that use haptic, auditory, and proprioceptive data acquired through robot exploratory behaviors to go beyond vision to ground natural language words describing objects using supervision from an interactive humanrobot "I Spy" game.
The RobotSlang Benchmark: Dialog-guided Robot Localization and Navigation
TLDR
Robot Simultaneous Localization and Mapping with Natural Language (RobotSlang), a benchmark of 169 natural language dialogs between a human Driver controlling a robot and a human Commander providing guidance towards navigation goals, is introduced.
Improving Grounded Natural Language Understanding through Human-Robot Dialog
TLDR
This work presents an end-to-end pipeline for translating natural language commands to discrete robot actions, and uses clarification dialogs to jointly improve language parsing and concept grounding.
...
1
2
3
4
...