Vokenization: Improving Language Understanding via Contextualized, Visually-Grounded Supervision

  title={Vokenization: Improving Language Understanding via Contextualized, Visually-Grounded Supervision},
  author={Hao Tan and Mohit Bansal},
  booktitle={Conference on Empirical Methods in Natural Language Processing},
Humans learn language by listening, speaking, writing, reading, and also, via interaction with the multimodal real world. Existing language pre-training frameworks show the effectiveness of text-only self-supervision while we explore the idea of a visually-supervised language model in this paper. We find that the main reason hindering this exploration is the large divergence in magnitude and distributions between the visually-grounded language datasets and pure-language corpora. Therefore, we… 

Figures and Tables from this paper

Word Representation Learning in Multimodal Pre-Trained Transformers: An Intrinsic Evaluation

A generalized advantage of multimodal representations over language- only ones on concrete word pairs, but not on abstract ones is observed, which confirms the effectiveness of these models to align language and vision, which results in better semantic representations for concepts that are grounded in images.

Does Vision-and-Language Pretraining Improve Lexical Grounding?

It is found that the multimodal models fail to signif-icantly outperform the text-only variants, suggesting that future work is required if multimodals pretraining is to be pursued as a means of improving NLP in general.

VidLanKD: Improving Language Understanding via Video-Distilled Knowledge Transfer

This work presents V ID L AN KD, a video-language knowledge distillation method for improving language understanding, which achieves consistent improvements over text-only language models and vokenization models, on several downstream language understanding tasks including GLUE, SQuAD, and SWAG.

UNIMO-2: End-to-End Unified Vision-Language Grounded Learning

This paper proposes an end-to-end unified-modal pre-training framework, namely UNIMO-2, for joint learning on both aligned image-caption data and unaligned image-only and text-only corpus, and builds a unified Transformer model to jointly learn visual representations, textual representations and semantic alignment between images and texts.

Visual Grounding Strategies for Text-Only Natural Language Processing

This work proposes two strategies for applying multimodal models to text-only tasks using a placeholder to replace image input and harnesses image retrieval to match texts with related images during both pretraining and text- only downstream tasks.

How to Adapt Pre-trained Vision-and-Language Models to a Text-only Input?

The evaluations on both GLUE and Visual Property Norms show that care should be put into adapting VL models to zero-shot text-only tasks, while the models are less sensitive to how the authors adapt them to non-zero-shot tasks, indicating that current VL model do not necessarily gain better language understanding from their multimodal training.

Visually-augmented pretrained language models for NLP tasks without images

A novel visually-augmented fine-tuning approach that can be generally applied to various PLMs or NLP tasks, without using any retrieved or generated images, namely VAWI is proposed.

Language with Vision: a Study on Grounded Word and Sentence Embeddings

A series of evaluations on word similarity benchmarks shows that visual grounding is beneficial not only for concrete words, but also for abstract words, as well as for contextualized embeddings trained on corpora of relatively modest size.

Unsupervised Vision-and-Language Pre-training Without Parallel Images and Captions

This work proposes to conduct “mask-and-predict” pre-training on text-only and image-only corpora and introduces the object tags detected by an object recognition model as anchor points to bridge two modalities and finds that such a simple approach achieves performance close to a model pre-trained with aligned data, on four English V&L benchmarks.

MCSE: Multimodal Contrastive Learning of Sentence Embeddings

This work proposes a sentence embedding learning approach that exploits both visual and textual information via a multimodal contrastive objective and shows that this model excels in aligning semantically similar sentences, providing an explanation for its improved performance.



Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

This systematic study compares pre-training objectives, architectures, unlabeled datasets, transfer approaches, and other factors on dozens of language understanding tasks and achieves state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more.

Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks

This paper proposes a new learning method Oscar (Object-Semantics Aligned Pre-training), which uses object tags detected in images as anchor points to significantly ease the learning of alignments.

ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks

We present ViLBERT (short for Vision-and-Language BERT), a model for learning task-agnostic joint representations of image content and natural language. We extend the popular BERT architecture to a

Cross-lingual Language Model Pretraining

This work proposes two methods to learn cross-lingual language models (XLMs): one unsupervised that only relies on monolingual data, and one supervised that leverages parallel data with a new cross-lingsual language model objective.

Visually Grounded Neural Syntax Acquisition

We present the Visually Grounded Neural Syntax Learner (VG-NSL), an approach for learning syntactic representations and structures without any explicit supervision. The model learns by looking at

Unified Vision-Language Pre-Training for Image Captioning and VQA

VLP is the first reported model that achieves state-of-the-art results on both vision-language generation and understanding tasks, as disparate as image captioning and visual question answering, across three challenging benchmark datasets: COCO Captions, Flickr30k Captions and VQA 2.0.

Combining Language and Vision with a Multimodal Skip-gram Model

Since they propagate visual information to all words, the MMSKIP-GRAM models discover intriguing visual properties of abstract words, paving the way to realistic implementations of embodied theories of meaning.

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.

VL-BERT: Pre-training of Generic Visual-Linguistic Representations

A new pre-trainable generic representation for visual-linguistic tasks, called Visual-Linguistic BERT (VL-BERT), which adopts the simple yet powerful Transformer model as the backbone, and extends it to take both visual and linguistic embedded features as input.

Supervised Learning of Universal Sentence Representations from Natural Language Inference Data

It is shown how universal sentence representations trained using the supervised data of the Stanford Natural Language Inference datasets can consistently outperform unsupervised methods like SkipThought vectors on a wide range of transfer tasks.