PIGLeT: Language Grounding Through Neuro-Symbolic Interaction in a 3D World

  title={PIGLeT: Language Grounding Through Neuro-Symbolic Interaction in a 3D World},
  author={Rowan Zellers and Ari Holtzman and Matthew E. Peters and Roozbeh Mottaghi and Aniruddha Kembhavi and Ali Farhadi and Yejin Choi},
We propose PIGLeT: a model that learns physical commonsense knowledge through interaction, and then uses this knowledge to ground language. We factorize PIGLeT into a physical dynamics model, and a separate language model. Our dynamics model learns not just what objects are but also what they do: glass cups break when thrown, plastic ones don’t. We then use it as the interface to our language model, giving us a unified model of linguistic form and grounded meaning. PIGLeT can read a sentence… 

Figures and Tables from this paper

Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

It is shown how low-level skills can be combined with large language models so that the language model provides high-level knowledge about the procedures for performing complex and temporally extended instructions, while value functions associated with these skills provide the grounding necessary to connect this knowledge to a particular physical environment.

Pretraining on Interactions for Learning Grounded Affordance Representations

A neural network is trained to predict objects? trajectories in a simulated interaction and it is shown that the network?s latent representations differentiate between both observed and unobserved affordances.

Enriching Language Models with Visually-grounded Word Vectors and the Lancaster Sensorimotor Norms

It is found that enriching language models with the Lancaster norms and image vectors improves results in both tasks, with some implications for robust language models that capture holistic linguistic meaning in a language learning context.

Transferring Knowledge from Vision to Language: How to Achieve it and how to Measure it?

The use of visual data to complement the knowledge of large language models is investigated by proposing a method for evaluating visual knowledge transfer to text for uni- or multimodal language models and a model architecture that involves a visual imagination step is introduced.

Skill Induction and Planning with Latent Language

A framework for learning hierarchical policies from demonstrations, using sparse natural language annotations to guide the discovery of reusable skills for autonomous decision-making, achieves performance comparable state-of-the-art models on ALFRED success rate and outperforming several recent methods with access to ground-truth plans.

Contextualized Sensorimotor Norms: multi-dimensional measures of sensorimotor strength for ambiguous English words, in context

Most large language models are trained on 001 linguistic input alone, yet humans appear to 002 ground their understanding of words in senso- 003 rimotor experience. A natural solution is to 004

Visual Commonsense in Pretrained Unimodal and Multimodal Models

The Visual Commonsense Tests (ViComTe) dataset is created and results indicate that multimodal models better reconstruct attribute distributions, but are still subject to reporting bias, and increasing model size does not enhance performance, suggesting that the key to visual commonsense lies in the data.

Distributional Semantics Still Can’t Account for Affordances

Can we know a word by the company it keeps? Aspects of meaning that concern physical interactions might be partic-ularly difficult to learn from language alone. Glenberg and Robertson (2000) found

A Theory of Natural Intelligence

It is proposed that the structural regularity of the brain takes the form of net fragments (self-organized network patterns) and that these serve as the powerful inductive bias that enables the brain to learn quickly, generalize from few examples and bridge the gap between abstractly defined general goals and concrete situations.

Word Acquisition in Neural Language Models

It is found that the effects of concreteness, word length, and lexical class are pointedly different in children and language models, reinforcing the importance of interaction and sensorimotor experience in child language acquisition.



A Benchmark for Systematic Generalization in Grounded Language Understanding

A new benchmark, gSCAN, is introduced for evaluating compositional generalization in models of situated language understanding, taking inspiration from standard models of meaning composition in formal linguistics and defining a language grounded in the states of a grid world.

Jointly Learning to Parse and Perceive: Connecting Natural Language to the Physical World

This paper introduces Logical Semantics with Perception (LSP), a model for grounded language acquisition that learns to map natural language statements to their referents in a physical environment and finds that LSP outperforms existing, less expressive models that cannot represent relational language.

ALFRED: A Benchmark for Interpreting Grounded Instructions for Everyday Tasks

It is shown that a baseline model based on recent embodied vision-and-language tasks performs poorly on ALFRED, suggesting that there is significant room for developing innovative grounded visual language understanding models with this benchmark.

A Joint Model of Language and Perception for Grounded Attribute Learning

This work presents an approach for joint learning of language and perception models for grounded attribute induction, which includes a language model based on a probabilistic categorial grammar that enables the construction of compositional meaning representations.

Generalization without Systematicity: On the Compositional Skills of Sequence-to-Sequence Recurrent Networks

This paper introduces the SCAN domain, consisting of a set of simple compositional navigation commands paired with the corresponding action sequences, and tests the zero-shot generalization capabilities of a variety of recurrent neural networks trained on SCAN with sequence-to-sequence methods.

Extending Machine Language Models toward Human-Level Language Understanding

This work describes exist- ing machine models linking language to concrete situations, and point toward extensions to address more abstract cases.

From Recognition to Cognition: Visual Commonsense Reasoning

To move towards cognition-level understanding, a new reasoning engine is presented, Recognition to Cognition Networks (R2C), that models the necessary layered inferences for grounding, contextualization, and reasoning.

Embodied Question Answering

A new AI task where an agent is spawned at a random location in a 3D environment and asked a question ('What color is the car?'), and the agent must first intelligently navigate to explore the environment, gather necessary visual information through first-person (egocentric) vision, and then answer the question.

Actions ~ Transformations

A novel representation for actions is proposed by modeling an action as a transformation which changes the state of the environment before the action happens (precondition) to the state after the action (effect).

IQA: Visual Question Answering in Interactive Environments

The Hierarchical Interactive Memory Network (HIMN), consisting of a factorized set of controllers, allowing the system to operate at multiple levels of temporal abstraction, is proposed, and outperforms popular single controller based methods on IQUAD V1.