Corpus ID: 236447346

Language Grounding with 3D Objects

  title={Language Grounding with 3D Objects},
  author={Jesse Thomason and Mohit Shridhar and Yonatan Bisk and Chris Paxton and Luke Zettlemoyer},
Seemingly simple natural language requests to a robot are generally underspecified, for example Can you bring me the wireless mouse? Flat images of candidate mice may not provide the discriminative information needed for wireless. The world, and objects in it, are not flat images but complex 3D shapes. If a human requests an object based on any of its basic properties, such as color, shape, or texture, robots should perform the necessary exploration to accomplish the task. In particular, while… Expand


Grounding Language Attributes to Objects using Bayesian Eigenobjects
A system to disambiguate object instances within the same class based on simple physical descriptions, designed to learn from only a small amount of human-labeled language data and generalize to viewpoints not represented in the language-annotated depth image training set. Expand
Shapeglot: Learning Language for Shape Differentiation
A practical approach to language grounding is illustrated, and a novel case study in the relationship between object shape and linguistic structure when it comes to object differentiation is provided. Expand
INGRESS: Interactive visual grounding of referring expressions
INGRESS, a robot system that follows human natural language instructions to pick and place everyday objects, is presented and a two-stage neural-network model for grounding is proposed and outperformed a state-of-the-art method on the RefCOCO dataset and in robot experiments with humans. Expand
Improving Robot Success Detection using Static Object Data
It is shown that adding static data about the objects themselves improves the performance of an end-to-end pipeline for classifying action outcomes, and achieves up to a 57% absolute gain over the task baseline on pairs of previously unseen objects. Expand
Generalized Grounding Graphs: A Probabilistic Framework for Understanding Grounded Commands
The framework, called Generalized Grounding Graphs (G 3), addresses issues by defining a probabilistic graphical model dynamically according to the linguistic parse structure of a natural language command, and enables robots to learn word meanings and use those learned meanings to robustly follow natural language commands produced by untrained users. Expand
Sim-to-Real Transfer for Vision-and-Language Navigation
To bridge the gap between the high-level discrete action space learned by the VLN agent, and the robot's low-level continuous action space, a subgoal model is proposed to identify nearby waypoints, and domain randomization is used to mitigate visual domain differences. Expand
ShapeNet: An Information-Rich 3D Model Repository
ShapeNet contains 3D models from a multitude of semantic categories and organizes them under the WordNet taxonomy, a collection of datasets providing many semantic annotations for each 3D model such as consistent rigid alignments, parts and bilateral symmetry planes, physical sizes, keywords, as well as other planned annotations. Expand
Vision-and-Language Navigation: Interpreting Visually-Grounded Navigation Instructions in Real Environments
This work provides the first benchmark dataset for visually-grounded natural language navigation in real buildings - the Room-to-Room (R2R) dataset and presents the Matter-port3D Simulator - a large-scale reinforcement learning environment based on real imagery. Expand
Grounding Language in Play
A simple and scalable way to condition policies on human language instead of language pairing is presented, and a simple technique that transfers knowledge from large unlabeled text corpora to robotic learning is introduced that significantly improves downstream robotic manipulation. Expand
Jointly Improving Parsing and Perception for Natural Language Commands through Human-Robot Dialog
Methods for using human-robot dialog to improve language understanding for a mobile robot agent that parses natural language to underlying semantic meanings and uses robotic sensors to create multi-modal models of perceptual concepts like red and heavy are presented. Expand