Language with Vision: a Study on Grounded Word and Sentence Embeddings

@article{Shahmohammadi2022LanguageWV,
  title={Language with Vision: a Study on Grounded Word and Sentence Embeddings},
  author={Hassan Shahmohammadi and Maria Heitmeier and Elnaz Shafaei-Bajestan and Hendrik P. A. Lensch and Harald Baayen},
  journal={ArXiv},
  year={2022},
  volume={abs/2206.08823}
}
Language grounding to vision is an active field of research aiming to enrich text-based representations of word meanings by leveraging perceptual knowledge from vision. Despite many attempts at language grounding, it is still unclear how to effectively inject visual knowledge into the word embeddings of a language in such a way that a proper balance of textual and visual knowledge is maintained. Some common concerns are the following. Is visual grounding beneficial for abstract words or is its… 

Figures and Tables from this paper

Visual grounding of abstract and concrete words: A response to Günther et al. (2020)

This model aligns word embeddings with their corresponding visual representation without deteriorating the knowledge captured by textual distributional information, and applies this model to a behavioral experiment, which addresses the plausibility of having visual mental representations for abstract words.

Visual Grounding of Inter-lingual Word-Embeddings

This study investigates the inter-lingual visual grounding of word embeddings in English, Arabic, and German and proposes an im-plicit alignment technique between the two spaces of vision and language in which inter- Lingual textual information interacts in order to enrich pre-trained textual word embedDings.

References

SHOWING 1-10 OF 112 REFERENCES

Visual grounding of abstract and concrete words: A response to Günther et al. (2020)

This model aligns word embeddings with their corresponding visual representation without deteriorating the knowledge captured by textual distributional information, and applies this model to a behavioral experiment, which addresses the plausibility of having visual mental representations for abstract words.

Learning Zero-Shot Multifaceted Visually Grounded Word Embeddings via Multi-Task Training

This paper argues that since concrete and abstract words are processed differently in the brain, such approaches sacrifice the abstract knowledge obtained from textual statistics in the process of acquiring perceptual information, and implicitly grounding the word embeddings is needed.

Does Vision-and-Language Pretraining Improve Lexical Grounding?

It is found that the multimodal models fail to signif-icantly outperform the text-only variants, suggesting that future work is required if multimodals pretraining is to be pursued as a means of improving NLP in general.

Vokenization: Improving Language Understanding via Contextualized, Visually-Grounded Supervision

A technique named "vokenization" is developed that extrapolates multimodal alignments to language-only data by contextually mapping language tokens to their related images (which the authors call "vokens").

Visual Grounding Strategies for Text-Only Natural Language Processing

This work proposes two strategies for applying multimodal models to text-only tasks using a placeholder to replace image input and harnesses image retrieval to match texts with related images during both pretraining and text- only downstream tasks.

Illustrative Language Understanding: Large-Scale Visual Grounding with Image Search

Picturebook, a large-scale lookup operation to ground language via ‘snapshots’ of the authors' physical world accessed through image search, is introduced and it is shown that gate activations corresponding to Picturebook embeddings are highly correlated to human judgments of concreteness ratings.

Incorporating Visual Semantics into Sentence Representations within a Grounded Space

A model to transfer visual information to textual representations by learning an intermediate representation space: the grounded space is proposed and it is shown that this model outperforms the previous state-of-the-art on classification and semantic relatedness tasks.

Multimodal Distributional Semantics

This work proposes a flexible architecture to integrate text- and image-based distributional information, and shows in a set of empirical tests that the integrated model is superior to the purely text-based approach, and it provides somewhat complementary semantic information with respect to the latter.

Imagined Visual Representations as Multimodal Embeddings

This paper presents a simple and effective method that learns a language-to-vision mapping and uses its output visual predictions to build multimodal representations, providing a cognitively plausible way of building representations, consistent with the inherently re-constructive and associative nature of human memory.

Learning Visually Grounded Sentence Representations

In this work, grounded sentence representations are investigated, where a sentence encoder is trained to predict the image features of a given caption and use the resultant features as sentence representations.
...