Representations of language in a model of visually grounded speech signal

@article{Chrupaa2017RepresentationsOL,
  title={Representations of language in a model of visually grounded speech signal},
  author={Grzegorz Chrupała and Lieke Gelderloos and A. Alishahi},
  journal={ArXiv},
  year={2017},
  volume={abs/1702.01991}
}
We present a visually grounded model of speech perception which projects spoken utterances and images to a joint semantic space. We use a multi-layer recurrent highway network to model the temporal nature of spoken speech, and show that it learns to extract both form and meaning-based linguistic knowledge from the input signal. We carry out an in-depth analysis of the representations used by different components of the trained model and show that encoding of semantic aspects tends to become… Expand
Word Recognition, Competition, and Activation in a Model of Visually Grounded Speech
TLDR
It is found that not all speech frames play an equal role in the final encoded representation of a given word, but that some frames have a crucial effect on it and it is suggested that word representation could be activated through a process of lexical competition. Expand
A Spoken Language Dataset of Descriptions for Speech- and Percept-Based Learning
Grounded language acquisition is a major area of research combining aspects of 1 natural language processing, computer vision, and signal processing, compounded 2 by domain issues requiring sampleExpand
Encoding of phonology in a recurrent neural model of grounded speech
TLDR
It is found that phoneme representations are most salient in the lower layers of the model, where low-level signals are processed at a fine-grained level, although a large amount of phonological information is retain at the top recurrent layer. Expand
Symbolic Inductive Bias for Visually Grounded Learning of Spoken Language
TLDR
This work proposes to use multitask learning to exploit existing transcribed speech within the end-to-end setting, and describes a three-task architecture which combines the objectives of matching spoken captions with corresponding images, speech with text, and text with images. Expand
Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input
In this paper, we explore neural network models that learn to associate segments of spoken audio captions with the semantically relevant portions of natural images that they refer to. We demonstrateExpand
Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input
In this paper, we explore neural network models that learn to associate segments of spoken audio captions with the semantically relevant portions of natural images that they refer to. We demonstrateExpand
Learning Visually Grounded and Multilingual Representations
Children early on face the challenge of learning the meaning of words from noisy and ambiguous contexts. Utterances that guide their learning are emitted in complex scenes rendering the mappingExpand
Visually Grounded Learning of Keyword Prediction from Untranscribed Speech
TLDR
This work uses an image-to-words multi-label visual classifier to tag images with soft textual labels, and then trains a neural network to map from the speech to these soft targets, and shows that the resulting speech system is able to predict which words occur in an utterance without seeing any parallel speech and text. Expand
Language learning using Speech to Image retrieval
TLDR
This work improves on existing neural network approaches to create visually grounded embeddings for spoken utterances and shows that the visually grounded sentence encoder learns to recognise words from the input even though it is not explicitly trained for word recognition. Expand
Towards Visually Grounded Sub-word Speech Unit Discovery
  • David F. Harwath, James R. Glass
  • Computer Science, Engineering
  • ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2019
TLDR
It is shown how diphone boundaries can be superficially extracted from the activation patterns of intermediate layers of the model, suggesting that the model may be leveraging these events for the purpose of word recognition. Expand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 34 REFERENCES
From phonemes to images: levels of representation in a recurrent neural model of visually-grounded language learning
TLDR
This work presents a model of visually-grounded language learning based on stacked gated recurrent neural networks which learns to predict visual features given an image description in the form of a sequence of phonemes, and shows that it represents linguistic information in a hierarchy of levels. Expand
Learning language through pictures
TLDR
The model consists of two Gated Recurrent Unit networks with shared word embeddings, and uses a multi-task objective by receiving a textual description of a scene and trying to concurrently predict its visual representation and the next word in the sentence. Expand
Learning Word-Like Units from Joint Audio-Visual Analysis
TLDR
This model effectively implements a form of spoken language acquisition, in which the computer learns not only to recognize word categories by sound, but also to enrich the words it learns with semantics by grounding them in images. Expand
Unsupervised Learning of Spoken Language with Visual Context
TLDR
A deep neural network model capable of rudimentary spoken language acquisition using untranscribed audio training data, whose only supervision comes in the form of contextually relevant visual images, is presented. Expand
Deep multimodal semantic embeddings for speech and images
TLDR
A model which takes as input a corpus of images with relevant spoken captions and finds a correspondence between the two modalities and ties the networks together with an embedding and alignment model which learns a joint semantic space over both modalities. Expand
Representation of Linguistic Form and Function in Recurrent Neural Networks
TLDR
A method for estimating the amount of contribution of individual tokens in the input to the final prediction of the networks is proposed and shows that the Visual pathway pays selective attention to lexical categories and grammatical functions that carry semantic information, and learns to treat word types differently depending on their grammatical function and their position in the sequential structure of the sentence. Expand
Learning Words from Images and Speech
TLDR
This work explores the possibility of learning both an acoustic model and a word/image association from multi-modal co-occurrences between speech and pictures alone (a task known as cross-situational learning), inspired by the observation that infants achieve spontaneously this kind of correspondence during their first year of life. Expand
A multimodal learning interface for grounding spoken language in sensory perceptions
TLDR
A multimodal interface that learns to associate spoken language with perceptual features by being situated in users' everyday environments and sharing user-centric multisensory information. Expand
Multimodal Semantic Learning from Child-Directed Input
TLDR
This work presents a distributed word learning model that operates on child-directed speech paired with realistic visual scenes that integrates linguistic and extra-linguistic information, handles referential uncertainty, and correctly learns to associate words with objects, even in cases of limited linguistic exposure. Expand
Learning words from sights and sounds: a computational model
TLDR
The model successfully performed speech segmentation, word discovery and visual categorization from spontaneous infant-directed speech paired with video images of single objects, demonstrating the possibility of using state-of-the-art techniques from sensory pattern recognition and machine learning to implement cognitive models which can process raw sensor data without the need for human transcription or labeling. Expand
...
1
2
3
4
...