Visually grounded models of spoken language: A survey of datasets, architectures and evaluation techniques

  title={Visually grounded models of spoken language: A survey of datasets, architectures and evaluation techniques},
  author={Grzegorz Chrupała},
  journal={J. Artif. Intell. Res.},
This survey provides an overview of the evolution of visually grounded models of spoken language over the last 20 years. Such models are inspired by the observation that when children pick up a language, they rely on a wide range of indirect and noisy clues, crucially including signals from the visual modality co-occurring with spoken utterances. Several fields have made important contributions to this approach to modeling or mimicking the process of learning language: Machine Learning, Natural… 

Figures and Tables from this paper

Cascaded Multilingual Audio-Visual Learning from Videos

A cascaded approach that leverages a model trained on English videos and applies it to audio-visual data in other languages, such as Japanese videos is proposed, showing an improvement in retrieval performance of nearly 10x compared to training on the Japanese videos solely.

Video-Guided Curriculum Learning for Spoken Video Grounding

It is proved that in the case of noisy sound, the proposed video-guided curriculum learning can facilitate the pre-training process to obtain a mutual audio encoder, significantly promoting the performance of spoken video grounding tasks.

Deep Learning Scoring Model in the Evaluation of Oral English Teaching

This study is aimed at improving the accuracy of oral English recognition and proposing evaluation measures with better performance. This work is based on related theories such as deep learning,

ConceptBeam: Concept Driven Target Speech Extraction

ConceptBeam clearly outperforms the baseline methods and effectively extracts speech based on the semantic representation, and is compared with two methods: one based on keywords obtained from recognition systems and another based on sound source separation.

Wnet: Audio-Guided Video Object Segmentation via Wavelet-Based Cross- Modal Denoising Networks

  • Wenwen PanHaonan Shi Qi Tian
  • Computer Science
    2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
  • 2022
This paper proposes the wavelet-based encoder network to learn the cross-modal representations of the video contents with audio-form queries, and adopts the multi-head cross- modal attention layers to explore the potential relations of video and query contents.

Self-Supervised Speech Representation Learning: A Review

This review presents approaches for self-supervised speech representation learning and their connection to other research areas, and reviews recent efforts on benchmarking learned representations to extend the application beyond speech recognition.

Word Discovery in Visually Grounded, Self-Supervised Speech Models

It is shown that powerful word segmentation and clustering capability emerges within the model’s self-attention heads, suggesting that the visual grounding task is a crucial component of the word discovery capability the authors observe.

Learning English with Peppa Pig

A simple bi-modal architecture is trained on the portion of the data consisting of dialog between characters, and evaluated on segments containing descriptive narrations that succeeds at learning aspects of the visual semantics of spoken language.

Self-Supervised Representation Learning for Speech Using Visual Grounding and Masked Language Modeling

This paper describes the submissions to the ZeroSpeech 2021 Challenge and SUPERB benchmark, and introduces a novel extension of this model, FaST-VGS+, which is learned in a multi-task fashion with a masked language modeling objective in addition to the visual grounding objective.

Keyword localisation in untranscribed speech using visually grounded speech models

This work investigates to what extent keyword localisation is possible using a visually grounded speech (VGS) model, and considers four ways to equip VGS models with localisations capabilities.



Representations of language in a model of visually grounded speech signal

An in-depth analysis of the representations used by different components of the trained model shows that encoding of semantic aspects tends to become richer as the authors go up the hierarchy of layers, whereas encoding of form-related aspects of the language input tends to initially increase and then plateau or decrease.

Symbolic Inductive Bias for Visually Grounded Learning of Spoken Language

This work proposes to use multitask learning to exploit existing transcribed speech within the end-to-end setting, and describes a three-task architecture which combines the objectives of matching spoken captions with corresponding images, speech with text, and text with images.

Unsupervised Learning of Spoken Language with Visual Context

A deep neural network model capable of rudimentary spoken language acquisition using untranscribed audio training data, whose only supervision comes in the form of contextually relevant visual images, is presented.

Visually Grounded Learning of Keyword Prediction from Untranscribed Speech

This work uses an image-to-words multi-label visual classifier to tag images with soft textual labels, and then trains a neural network to map from the speech to these soft targets, and shows that the resulting speech system is able to predict which words occur in an utterance without seeing any parallel speech and text.

Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input

In this paper, we explore neural network models that learn to associate segments of spoken audio captions with the semantically relevant portions of natural images that they refer to. We demonstrate

Textual supervision for visually grounded spoken language understanding

Comparing different strategies, it is found that the pipeline approach works better when enough text is available and translations can be effectively used in place of transcriptions but more data is needed to obtain similar results.

Vision as an Interlingua: Learning Multilingual Semantic Embeddings of Untranscribed Speech

This work uses spoken captions collected in English and Hindi to show that the same model architecture can be successfully applied to both languages, and shows that these models are capable of performing semantic cross-lingual speech-to-speech retrieval.

On the Contributions of Visual and Textual Supervision in Low-resource Semantic Speech Retrieval

A multitask learning approach to leverage both visual and textual modalities, with visual supervision in the form of keyword probabilities from an external tagger found to be helpful even in the presence of textual supervision.

Language learning using Speech to Image retrieval

This work improves on existing neural network approaches to create visually grounded embeddings for spoken utterances and shows that the visually grounded sentence encoder learns to recognise words from the input even though it is not explicitly trained for word recognition.

Towards situated speech understanding: visual context priming of language models