Representations of language in a model of visually grounded speech signal

@inproceedings{Chrupaa2017RepresentationsOL,
  title={Representations of language in a model of visually grounded speech signal},
  author={Grzegorz Chrupała and Lieke Gelderloos and A. Alishahi},
  booktitle={ACL},
  year={2017}
}
We present a visually grounded model of speech perception which projects spoken utterances and images to a joint semantic space. We use a multi-layer recurrent highway network to model the temporal nature of spoken speech, and show that it learns to extract both form and meaning-based linguistic knowledge from the input signal. We carry out an in-depth analysis of the representations used by different components of the trained model and show that encoding of semantic aspects tends to become… 

Figures and Tables from this paper

Word Recognition, Competition, and Activation in a Model of Visually Grounded Speech
TLDR
It is found that not all speech frames play an equal role in the final encoded representation of a given word, but that some frames have a crucial effect on it and it is suggested that word representation could be activated through a process of lexical competition.
A Spoken Language Dataset of Descriptions for Speech- and Percept-Based Learning
TLDR
A multimodal dataset of RGB+depth objects with spoken as well as textual descriptions is presented to enable researchers studying the intersection of robotics, NLP, and HCI to better investigate how the multiple modalities of image, depth, text, speech, and transcription interact.
Encoding of phonology in a recurrent neural model of grounded speech
TLDR
It is found that phoneme representations are most salient in the lower layers of the model, where low-level signals are processed at a fine-grained level, although a large amount of phonological information is retain at the top recurrent layer.
Symbolic Inductive Bias for Visually Grounded Learning of Spoken Language
TLDR
This work proposes to use multitask learning to exploit existing transcribed speech within the end-to-end setting, and describes a three-task architecture which combines the objectives of matching spoken captions with corresponding images, speech with text, and text with images.
Visually grounded models of spoken language: A survey of datasets, architectures and evaluation techniques
TLDR
An overview of the evolution of visually grounded models of spoken language over the last 20 years is provided, which discusses the central research questions addressed, the timeline of developments, and the datasets which enabled much of this work.
Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input
In this paper, we explore neural network models that learn to associate segments of spoken audio captions with the semantically relevant portions of natural images that they refer to. We demonstrate
Learning Visually Grounded and Multilingual Representations
TLDR
A novel computational model of cross-situational word learning is proposed that takes images of natural scenes paired with their descriptions as input and incrementally learns probabilistic associations between words and image features.
Visually Grounded Learning of Keyword Prediction from Untranscribed Speech
TLDR
This work uses an image-to-words multi-label visual classifier to tag images with soft textual labels, and then trains a neural network to map from the speech to these soft targets, and shows that the resulting speech system is able to predict which words occur in an utterance without seeing any parallel speech and text.
Learning English with Peppa Pig
TLDR
A simple bi-modal architecture is trained on the portion of the data consisting of dialog between characters, and evaluated on segments containing descriptive narrations of Peppa Pig to succeed at learning aspects of the visual semantics of spoken language.
Language learning using Speech to Image retrieval
TLDR
This work improves on existing neural network approaches to create visually grounded embeddings for spoken utterances and shows that the visually grounded sentence encoder learns to recognise words from the input even though it is not explicitly trained for word recognition.
...
...

References

SHOWING 1-10 OF 34 REFERENCES
From phonemes to images: levels of representation in a recurrent neural model of visually-grounded language learning
TLDR
This work presents a model of visually-grounded language learning based on stacked gated recurrent neural networks which learns to predict visual features given an image description in the form of a sequence of phonemes, and shows that it represents linguistic information in a hierarchy of levels.
Learning language through pictures
TLDR
The model consists of two Gated Recurrent Unit networks with shared word embeddings, and uses a multi-task objective by receiving a textual description of a scene and trying to concurrently predict its visual representation and the next word in the sentence.
Learning Word-Like Units from Joint Audio-Visual Analysis
TLDR
This model effectively implements a form of spoken language acquisition, in which the computer learns not only to recognize word categories by sound, but also to enrich the words it learns with semantics by grounding them in images.
Unsupervised Learning of Spoken Language with Visual Context
TLDR
A deep neural network model capable of rudimentary spoken language acquisition using untranscribed audio training data, whose only supervision comes in the form of contextually relevant visual images, is presented.
Deep multimodal semantic embeddings for speech and images
TLDR
A model which takes as input a corpus of images with relevant spoken captions and finds a correspondence between the two modalities and ties the networks together with an embedding and alignment model which learns a joint semantic space over both modalities.
Representation of Linguistic Form and Function in Recurrent Neural Networks
TLDR
A method for estimating the amount of contribution of individual tokens in the input to the final prediction of the networks is proposed and shows that the Visual pathway pays selective attention to lexical categories and grammatical functions that carry semantic information, and learns to treat word types differently depending on their grammatical function and their position in the sequential structure of the sentence.
Learning Words from Images and Speech
TLDR
This work explores the possibility of learning both an acoustic model and a word/image association from multi-modal co-occurrences between speech and pictures alone (a task known as cross-situational learning), inspired by the observation that infants achieve spontaneously this kind of correspondence during their first year of life.
A multimodal learning interface for grounding spoken language in sensory perceptions
TLDR
A multimodal interface that learns to associate spoken language with perceptual features by being situated in users' everyday environments and sharing user-centric multisensory information.
Multimodal Semantic Learning from Child-Directed Input
TLDR
This work presents a distributed word learning model that operates on child-directed speech paired with realistic visual scenes that integrates linguistic and extra-linguistic information, handles referential uncertainty, and correctly learns to associate words with objects, even in cases of limited linguistic exposure.
...
...