Corpus ID: 237581628

Does Vision-and-Language Pretraining Improve Lexical Grounding?

  title={Does Vision-and-Language Pretraining Improve Lexical Grounding?},
  author={Tian Yun and Chen Sun and Ellie Pavlick},
Linguistic representations derived from text alone have been criticized for their lack of grounding, i.e., connecting words to their meanings in the physical world. Vision-andLanguage (VL) models, trained jointly on text and image or video data, have been offered as a response to such criticisms. However, while VL pretraining has shown success on multimodal tasks such as visual question answering, it is not yet known how the internal linguistic representations themselves compare to their text… Expand

Figures and Tables from this paper


Vokenization: Improving Language Understanding via Contextualized, Visually-Grounded Supervision
A technique named "vokenization" is developed that extrapolates multimodal alignments to language-only data by contextually mapping language tokens to their related images (which the authors call "vokens"). Expand
Probing Pretrained Language Models for Lexical Semantics
A systematic empirical analysis across six typologically diverse languages and five different lexical tasks indicates patterns and best practices that hold universally, but also point to prominent variations across languages and tasks. Expand
Combining Language and Vision with a Multimodal Skip-gram Model
Since they propagate visual information to all words, the MMSKIP-GRAM models discover intriguing visual properties of abstract words, paving the way to realistic implementations of embodied theories of meaning. Expand
Language Models are Unsupervised Multitask Learners
It is demonstrated that language models begin to learn these tasks without any explicit supervision when trained on a new dataset of millions of webpages called WebText, suggesting a promising path towards building language processing systems which learn to perform tasks from their naturally occurring demonstrations. Expand
What Does BERT with Vision Look At?
It is demonstrated that certain attention heads of a visually grounded language model actively ground elements of language to image regions, performing the task known as entity grounding. Expand
Improving Vision-and-Language Navigation with Image-Text Pairs from the Web
VLN-BERT, a visiolinguistic transformer-based model for scoring the compatibility between an instruction ('...stop at the brown sofa') and a sequence of panoramic RGB images captured by the agent, is developed. Expand
Do Neural Language Representations Learn Physical Commonsense?
While recent advancements of neural language models have demonstrated strong performance on various types of natural language inference tasks, this study based on a dataset of over 200k newly collected annotations suggests that neural language representations still only learn associations that are explicitly written down. Expand
What do you learn from context? Probing for sentence structure in contextualized word representations
A novel edge probing task design is introduced and a broad suite of sub-sentence tasks derived from the traditional structured NLP pipeline are constructed to investigate how sentence structure is encoded across a range of syntactic, semantic, local, and long-range phenomena. Expand
Learning Transferable Visual Models From Natural Language Supervision
It is demonstrated that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet. Expand
What BERT Is Not: Lessons from a New Suite of Psycholinguistic Diagnostics for Language Models
A suite of diagnostics drawn from human language experiments are introduced, which allow us to ask targeted questions about information used by language models for generating predictions in context, and the popular BERT model is applied. Expand