Corpus ID: 233407647

Visually grounded models of spoken language: A survey of datasets, architectures and evaluation techniques

  title={Visually grounded models of spoken language: A survey of datasets, architectures and evaluation techniques},
  author={Grzegorz Chrupała},
This survey provides an overview of the evolution of visually grounded models of spoken language over the last 20 years. Such models are inspired by the observation that when children pick up a language, they rely on a wide range of indirect and noisy clues, crucially including signals from the visual modality co-occurring with spoken utterances. Several fields have made important contributions to this approach to modeling or mimicking the process of learning language: Machine Learning, Natural… Expand

Figures and Tables from this paper

Discrete representations in neural models of spoken language
A systematic analysis of the impact of architectural choices, the learning objective and training dataset, and the evaluation metric on the merits of four commonly used metrics in the context of weakly supervised models of spoken language finds that the different evaluation metrics can give inconsistent results. Expand
Fast-Slow Transformer for Visually Grounding Speech
We present Fast-Slow Transformer for Visually Grounding Speech, or FaST-VGS. FaST-VGS is a Transformer-based model for learning the associations between raw speech waveforms and visual images. TheExpand
Ju l 2 02 1 ZR-2021 VG : Zero-Resource Speech Challenge , Visually-Grounded Language Modelling track , 2021 edition Version 2 . 0 – final for NeurIPS
Learning to comprehend and produce spoken languages is one of the hallmarks of human cognition, and the importance of speech communication also makes speech-based capabilities central to AIExpand
AVLnet: Learning Audio-Visual Language Representations from Instructional Videos
This work introduces Audio-Video Language Network (AVLnet), a self-supervised network that learns a shared audio-visual embedding space directly from raw video inputs, and performs analysis of AVLnet's learned representations, showing the model has learned to relate visual objects with salient words and natural sounds. Expand


Representations of language in a model of visually grounded speech signal
An in-depth analysis of the representations used by different components of the trained model shows that encoding of semantic aspects tends to become richer as the authors go up the hierarchy of layers, whereas encoding of form-related aspects of the language input tends to initially increase and then plateau or decrease. Expand
Symbolic Inductive Bias for Visually Grounded Learning of Spoken Language
This work proposes to use multitask learning to exploit existing transcribed speech within the end-to-end setting, and describes a three-task architecture which combines the objectives of matching spoken captions with corresponding images, speech with text, and text with images. Expand
Unsupervised Learning of Spoken Language with Visual Context
A deep neural network model capable of rudimentary spoken language acquisition using untranscribed audio training data, whose only supervision comes in the form of contextually relevant visual images, is presented. Expand
Visually Grounded Learning of Keyword Prediction from Untranscribed Speech
This work uses an image-to-words multi-label visual classifier to tag images with soft textual labels, and then trains a neural network to map from the speech to these soft targets, and shows that the resulting speech system is able to predict which words occur in an utterance without seeing any parallel speech and text. Expand
Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input
In this paper, we explore neural network models that learn to associate segments of spoken audio captions with the semantically relevant portions of natural images that they refer to. We demonstrateExpand
Textual supervision for visually grounded spoken language understanding
Comparing different strategies, it is found that the pipeline approach works better when enough text is available and translations can be effectively used in place of transcriptions but more data is needed to obtain similar results. Expand
Vision as an Interlingua: Learning Multilingual Semantic Embeddings of Untranscribed Speech
This work uses spoken captions collected in English and Hindi to show that the same model architecture can be successfully applied to both languages, and shows that these models are capable of performing semantic cross-lingual speech-to-speech retrieval. Expand
On the Contributions of Visual and Textual Supervision in Low-resource Semantic Speech Retrieval
A multitask learning approach to leverage both visual and textual modalities, with visual supervision in the form of keyword probabilities from an external tagger found to be helpful even in the presence of textual supervision. Expand
Language learning using Speech to Image retrieval
This work improves on existing neural network approaches to create visually grounded embeddings for spoken utterances and shows that the visually grounded sentence encoder learns to recognise words from the input even though it is not explicitly trained for word recognition. Expand
Towards situated speech understanding: visual context priming of language models
The underlying principles of this model may be applied to a wide range of speech understanding problems including mobile and assistive technologies in which contextual information can be sensed and semantically interpreted to bias processing. Expand