• Corpus ID: 236447346

Language Grounding with 3D Objects

  title={Language Grounding with 3D Objects},
  author={Jesse Thomason and Mohit Shridhar and Yonatan Bisk and Chris Paxton and Luke Zettlemoyer},
: Seemingly simple natural language requests to a robot are generally underspecified, for example Can you bring me the wireless mouse? Flat images of candidate mice may not provide the discriminative information needed for wireless . The world, and objects in it, are not flat images but complex 3D shapes. If a human requests an object based on any of its basic properties, such as color, shape, or texture, robots should perform the necessary exploration to accomplish the task. In particular, while… 

PartGlot: Learning Shape Part Segmentation from Language Reference Games

PartGlot, a neural framework and associated architectures for learning semantic part segmentation of 3D shape geometry, based solely on part referential language is introduced, opening the possibility of learning 3Dshape parts from language alone, without the need for large-scale part geometry annotations, thus facilitating annotation acquisition.

Correcting Robot Plans with Natural Language Feedback

This paper describes how to map from natural language sentences to transformations of cost functions and shows that these transformations enable users to correct goals, update robot motions to accommodate additional user preferences, and recover from planning errors.

Robots Enact Malignant Stereotypes

This paper finds that robots powered by large datasets and Dissolution Models that contain humans risk physically amplifying malignant stereotypes in general; and recommends that robot learning methods that physically manifest stereotypes or other harmful outcomes be paused, reworked, or even wound down when appropriate, until outcomes can be proven safe, effective, and just.

D3Net: A Unified Speaker-Listener Architecture for 3D Dense Captioning and Visual Grounding

This work presents D 3 Net, an end-to-end neural speaker-listener architecture that can d etect, d escribe and d iscriminate, and outperforms SOTA methods in both tasks on the ScanRefer dataset, surpassing the SOTA 3D dense captioning method by a significant margin.

D3Net: A Speaker-Listener Architecture for Semi-supervised Dense Captioning and Visual Grounding in RGB-D Scans

The DNet is presented, an end-to-end neural speaker-listener architecture that can detect, describe and discriminate and introduces discriminability during object caption generation and enables semisupervised training on ScanNet data with partially annotated descriptions.

Voxel-informed Language Grounding

The Voxel-informed Language Grounder is presented, a language grounding model that leverages 3D geometric information in the form of voxel maps derived from the visual input using a volumetric reconstruction model that significantly improves grounding accuracy on SNARE, an object reference game task.

TriCoLo: Trimodal Contrastive Loss for Fine-grained Text to Shape Retrieval

This work shows that with large batch contrastive learning the authors achieve SoTA on textshape retrieval without complex attention mechanisms or losses and proposes a trimodal learning scheme to achieve even higher performance and better representations for all modalities.

TriCoLo: Trimodal Contrastive Loss for Text to Shape Retrieval

It is shown that with large batch contrastive learning the authors achieve SoTA on text-to-shape retrieval without complex attention mecha-nisms or losses and that a trimodal learning scheme can lead to even higher performance and better representations for all modalities.

Affection: Learning Affective Explanations for Real-World Visual Data

In this work, we explore the emotional reactions that real-world images tend to induce by using natural language as the medium to express the rationale behind an affective response to a given visual

Music-to-Text Synaesthesia: Generating Descriptive Text from Music Recordings

In this paper, we consider a novel research problem, music-to-text synaesthesia. Different from the classical music tagging problem that classifies a music recording into pre-defined categories, the



Grounding Language Attributes to Objects using Bayesian Eigenobjects

A system to disambiguate object instances within the same class based on simple physical descriptions, designed to learn from only a small amount of human-labeled language data and generalize to viewpoints not represented in the language-annotated depth image training set.

Shapeglot: Learning Language for Shape Differentiation

A practical approach to language grounding is illustrated, and a novel case study in the relationship between object shape and linguistic structure when it comes to object differentiation is provided.

INGRESS: Interactive visual grounding of referring expressions

INGRESS, a robot system that follows human natural language instructions to pick and place everyday objects, is presented and a two-stage neural-network model for grounding is proposed and outperformed a state-of-the-art method on the RefCOCO dataset and in robot experiments with humans.

Improving Robot Success Detection using Static Object Data

It is shown that adding static data about the objects themselves improves the performance of an end-to-end pipeline for classifying action outcomes, and achieves up to a 57% absolute gain over the task baseline on pairs of previously unseen objects.

Grounding Language in Play

A simple and scalable way to condition policies on human language instead of language pairing is presented, and a simple technique that transfers knowledge from large unlabeled text corpora to robotic learning is introduced that significantly improves downstream robotic manipulation.

VisualBERT: A Simple and Performant Baseline for Vision and Language

Analysis demonstrates that VisualBERT can ground elements of language to image regions without any explicit supervision and is even sensitive to syntactic relationships, tracking, for example, associations between verbs and image regions corresponding to their arguments.

Sim-to-Real Transfer for Vision-and-Language Navigation

To bridge the gap between the high-level discrete action space learned by the VLN agent, and the robot's low-level continuous action space, a subgoal model is proposed to identify nearby waypoints, and domain randomization is used to mitigate visual domain differences.

Generalized Grounding Graphs: A Probabilistic Framework for Understanding Grounded Commands

The framework, called Generalized Grounding Graphs (G 3), addresses issues by defining a probabilistic graphical model dynamically according to the linguistic parse structure of a natural language command, and enables robots to learn word meanings and use those learned meanings to robustly follow natural language commands produced by untrained users.

Vision-and-Language Navigation: Interpreting Visually-Grounded Navigation Instructions in Real Environments

This work provides the first benchmark dataset for visually-grounded natural language navigation in real buildings - the Room-to-Room (R2R) dataset and presents the Matter-port3D Simulator - a large-scale reinforcement learning environment based on real imagery.

Improving Vision-and-Language Navigation with Image-Text Pairs from the Web

VLN-BERT, a visiolinguistic transformer-based model for scoring the compatibility between an instruction ('...stop at the brown sofa') and a sequence of panoramic RGB images captured by the agent, is developed.