Corpus ID: 12534863

Learning Interpretable Spatial Operations in a Rich 3D Blocks World

@inproceedings{Bisk2018LearningIS,
  title={Learning Interpretable Spatial Operations in a Rich 3D Blocks World},
  author={Yonatan Bisk and Kevin J. Shih and Yejin Choi and Daniel Marcu},
  booktitle={AAAI},
  year={2018}
}
In this paper, we study the problem of mapping natural language instructions to complex spatial actions in a 3D blocks world. We first introduce a new dataset that pairs complex 3D spatial operations to rich natural language descriptions that require complex spatial and pragmatic interpretations such as "mirroring", "twisting", and "balancing". This dataset, built on the simulation environment of Bisk, Yuret, and Marcu (2016), attains language that is significantly richer and more complex… Expand
Photo-Realistic Blocksworld Dataset
TLDR
An artificial dataset generator for Photo-realistic Blocksworld domain, one of the oldest high-level task planning domain that is well defined but contains sufficient complexity, e.g., the conflicting subgoals and the decomposability into subproblems. Expand
Points, Paths, and Playscapes: Large-scale Spatial Language Understanding Tasks Set in the Real World
TLDR
It is argued that the next big advances in spatial language understanding can be best supported by creating largescale datasets that focus on points and paths based in the real world, and then extending these to create online, persistent playscapes that mix human and bot players. Expand
Stay on the Path: Instruction Fidelity in Vision-and-Language Navigation
TLDR
This work highlights shortcomings of current metrics for the Room-to-Room dataset and proposes a new metric, Coverage weighted by Length Score (CLS), and shows that agents that receive rewards for instruction fidelity outperform agents that focus on goal completion. Expand
Guiding Multi-Step Rearrangement Tasks with Natural Language Instructions
  • 2021
Enabling human operators to interact with robotic agents using natural 1 language would allow non-experts to intuitively instruct these agents. Towards 2 this goal, we propose a novelExpand
Spatial Language Understanding for Object Search in Partially Observed City-scale Environments
TLDR
A convolutional neural network model is proposed that learns to predict the language provider's relative frame of reference (FoR) given environment context and achieves faster search and higher success rate compared to a keyword-based baseline without spatial preposition understanding. Expand
Draw Me a Flower: Grounding Formal Abstract Structures Stated in Informal Natural Language
TLDR
Results of the baseline models on an instruction-to-execution task derived from the HEXAGONS dataset confirm that higher-level abstractions in NL are indeed more challenging for current systems to process. Expand
Generalization in Instruction Following Systems
TLDR
This paper focuses on instruction understanding in the blocks world domain and investigates the language understanding abilities of two top-performing systems for the task, finding that state-of-the-art models fall short of these expectations and are extremely brittle. Expand
Learning to Read Maps: Understanding Natural Language Instructions from Unseen Maps
Robust situated dialog requires the ability to process instructions based on spatial information, which may or may not be available. We propose a model, based on LXMERT, that can extract spatialExpand
Computational Models for Spatial Prepositions
TLDR
This paper treats the modeling task as calling for assignment of probabilities to prepositional relations as a function of multiple factors, where such probabilities can be viewed as estimates of whether humans would judge the relations to hold in given circumstances. Expand
Prospection: Interpretable plans from language by predicting the future
TLDR
A framework for learning representations that convert from a natural-language command to a sequence of intermediate goals for execution on a robot and a key feature of this framework is prospection, training an agent not just to correctly execute the prescribed command, but to predict a horizon of consequences of an action before taking it. Expand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 39 REFERENCES
Source-Target Inference Models for Spatial Instruction Understanding
TLDR
Novel models for the subtasks of source block classification and target position regression are presented, based on joint-loss language and spatial-world representation learning, as well as CNN-based and dual attention models to compute the alignment between the world blocks and the instruction phrases. Expand
Generation and Comprehension of Unambiguous Object Descriptions
TLDR
This work proposes a method that can generate an unambiguous description of a specific object or region in an image and which can also comprehend or interpret such an expression to infer which object is being described, and shows that this method outperforms previous methods that generate descriptions of objects without taking into account other potentially ambiguous objects in the scene. Expand
Toward Interactive Grounded Language Acqusition
TLDR
This paper extends Logical Semantics with Perception to incorporate determiners (e.g., “the”) into its training procedure, enabling the model to generate acceptable relational language 20% more often than the unaugmented model. Expand
Natural Language Communication with Robots
TLDR
It is shown how one can collect meaningful training data and the proposed three neural architectures for interpreting contextually grounded natural language commands allow us to correctly understand/ground the blocks that the robot should move when instructed by a human who uses unrestricted language. Expand
ReferItGame: Referring to Objects in Photographs of Natural Scenes
TLDR
A new game to crowd-source natural language referring expressions by designing a two player game that can both collect and verify referring expressions directly within the game and provides an in depth analysis of the resulting dataset. Expand
Mapping Instructions and Visual Observations to Actions with Reinforcement Learning
TLDR
This work learns a single model to jointly reason about linguistic and visual input in a contextual bandit setting to train a neural network agent and shows significant improvements over supervised learning and common reinforcement learning variants. Expand
Weakly Supervised Learning of Semantic Parsers for Mapping Instructions to Actions
TLDR
This paper shows semantic parsing can be used within a grounded CCG semantic parsing approach that learns a joint model of meaning and context for interpreting and executing natural language instructions, using various types of weak supervision. Expand
Learning visually grounded words and syntax for a scene description task
  • D. Roy
  • Computer Science
  • Comput. Speech Lang.
  • 2002
TLDR
A spoken language generation system that learns to describe objects in computer-generated visual scenes and generates syntactically well-formed compound adjective noun phrases, as well as relative spatial clauses was comparable to human-generated descriptions. Expand
Learning Multi-Modal Grounded Linguistic Semantics by Playing "I Spy"
TLDR
This paper builds perceptual models that use haptic, auditory, and proprioceptive data acquired through robot exploratory behaviors to go beyond vision to ground natural language words describing objects using supervision from an interactive humanrobot "I Spy" game. Expand
Grounding spatial relations for human-robot interaction
TLDR
A system for human-robot interaction that learns both models for spatial prepositions and for object recognition, and grounds the meaning of an input sentence in terms of visual percepts coming from the robot's sensors to send an appropriate command to the PR2 or respond to spatial queries. Expand
...
1
2
3
4
...