PIGLeT: Language Grounding Through Neuro-Symbolic Interaction in a 3D World

@inproceedings{Zellers2021PIGLeTLG,
  title={PIGLeT: Language Grounding Through Neuro-Symbolic Interaction in a 3D World},
  author={Rowan Zellers and Ari Holtzman and Matthew E. Peters and Roozbeh Mottaghi and Aniruddha Kembhavi and Ali Farhadi and Yejin Choi},
  booktitle={ACL},
  year={2021}
}
We propose PIGLeT: a model that learns physical commonsense knowledge through interaction, and then uses this knowledge to ground language. We factorize PIGLeT into a physical dynamics model, and a separate language model. Our dynamics model learns not just what objects are but also what they do: glass cups break when thrown, plastic ones don’t. We then use it as the interface to our language model, giving us a unified model of linguistic form and grounded meaning. PIGLeT can read a sentence… 

Figures and Tables from this paper

Enriching Language Models with Visually-grounded Word Vectors and the Lancaster Sensorimotor Norms
TLDR
It is found that enriching language models with the Lancaster norms and image vectors improves results in both tasks, with some implications for robust language models that capture holistic linguistic meaning in a language learning context.
Transferring Knowledge from Vision to Language: How to Achieve it and how to Measure it?
TLDR
The use of visual data to complement the knowledge of large language models is investigated by proposing a method for evaluating visual knowledge transfer to text for uni- or multimodal language models and a model architecture that involves a visual imagination step is introduced.
Skill Induction and Planning with Latent Language
We present a framework for learning hierarchical policies from demonstrations, using sparse natural language annotations to guide the discovery of reusable skills for autonomous decision-making. We
Contextualized Sensorimotor Norms: multi-dimensional measures of sensorimotor strength for ambiguous English words, in context
Most large language models are trained on linguistic input alone, yet humans appear to ground their understanding of words in sensorimotor experience. A natural solution is to augment LM
Learning Bidirectional Translation between Descriptions and Actions with Small Paired Data
TLDR
A two-stage training method for bidirectional translation between descriptions and actions using small paired data and the results showed that the method performed well, even when the amount of paired data to train was small.
A Theory of Natural Intelligence
TLDR
It is proposed that the structural regularity of the brain takes the form of net fragments (self-organized network patterns) and that these serve as the powerful inductive bias that enables the brain to learn quickly, generalize from few examples and bridge the gap between abstractly defined general goals and concrete situations.
Word Acquisition in Neural Language Models
TLDR
It is found that the effects of concreteness, word length, and lexical class are pointedly different in children and language models, reinforcing the importance of interaction and sensorimotor experience in child language acquisition.
Visual Commonsense in Pretrained Unimodal and Multimodal Models
TLDR
The Visual Commonsense Tests (ViComTe) dataset is created and it is shown that grounded color data correlates much better than ungrounded text-only data with crowdsourced color judgments provided by Paik et al. (2021).
Do As I Can, Not As I Say: Grounding Language in Robotic Affordances
TLDR
These experiments validate that SayCan can execute temporally-extended, complex, and abstract instructions andGrounding the LLM in the real-world via affordances nearly doubles the performance over the non-grounded baselines.
Commonsense Knowledge Reasoning and Generation with Pre-trained Language Models: A Survey
TLDR
A survey of commonsense knowledge acquisition and reasoning tasks, the strengths and weaknesses of state-of-the-art pre-trained models for commonsense reasoning and generation as revealed by these tasks, and reflects on future research directions are presented.
...
1
2
...

References

SHOWING 1-10 OF 43 REFERENCES
Jointly Learning to Parse and Perceive: Connecting Natural Language to the Physical World
TLDR
This paper introduces Logical Semantics with Perception (LSP), a model for grounded language acquisition that learns to map natural language statements to their referents in a physical environment and finds that LSP outperforms existing, less expressive models that cannot represent relational language.
ALFRED: A Benchmark for Interpreting Grounded Instructions for Everyday Tasks
TLDR
It is shown that a baseline model based on recent embodied vision-and-language tasks performs poorly on ALFRED, suggesting that there is significant room for developing innovative grounded visual language understanding models with this benchmark.
A Joint Model of Language and Perception for Grounded Attribute Learning
TLDR
This work presents an approach for joint learning of language and perception models for grounded attribute induction, which includes a language model based on a probabilistic categorial grammar that enables the construction of compositional meaning representations.
Generalization without Systematicity: On the Compositional Skills of Sequence-to-Sequence Recurrent Networks
TLDR
This paper introduces the SCAN domain, consisting of a set of simple compositional navigation commands paired with the corresponding action sequences, and tests the zero-shot generalization capabilities of a variety of recurrent neural networks trained on SCAN with sequence-to-sequence methods.
Embodied attention and word learning by toddlers
From Recognition to Cognition: Visual Commonsense Reasoning
TLDR
To move towards cognition-level understanding, a new reasoning engine is presented, Recognition to Cognition Networks (R2C), that models the necessary layered inferences for grounding, contextualization, and reasoning.
Embodied Question Answering
TLDR
A new AI task where an agent is spawned at a random location in a 3D environment and asked a question ('What color is the car?'), and the agent must first intelligently navigate to explore the environment, gather necessary visual information through first-person (egocentric) vision, and then answer the question.
The Fast and the Flexible: Training Neural Networks to Learn to Follow Instructions from Small Data
TLDR
Control experiments show that when the network is exposed to familiar instructions but containing novel words, the model adapts very efficiently to the new vocabulary, and even for human speakers whose language usage can depart significantly from the authors' artificial training language, the network can make use of its automatically acquired inductive bias to learn to follow instructions more effectively.
Actions ~ Transformations
TLDR
A novel representation for actions is proposed by modeling an action as a transformation which changes the state of the environment before the action happens (precondition) to the state after the action (effect).
Language Models are Unsupervised Multitask Learners
TLDR
It is demonstrated that language models begin to learn these tasks without any explicit supervision when trained on a new dataset of millions of webpages called WebText, suggesting a promising path towards building language processing systems which learn to perform tasks from their naturally occurring demonstrations.
...
1
2
3
4
5
...