Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input

@article{Harwath2018JointlyDV,
  title={Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input},
  author={David F. Harwath and Adri{\`a} Recasens and D{\'i}dac Sur{\'i}s and Galen Chuang and Antonio Torralba and James R. Glass},
  journal={ArXiv},
  year={2018},
  volume={abs/1804.01452}
}
In this paper, we explore neural network models that learn to associate segments of spoken audio captions with the semantically relevant portions of natural images that they refer to. We demonstrate that these audio-visual associative localizations emerge from network-internal representations learned as a by-product of training to perform an image-audio retrieval task. Our models operate directly on the image pixels and speech waveform, and do not rely on any conventional supervision in the… Expand

Paper Mentions

Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input
In this paper, we explore neural network models that learn to associate segments of spoken audio captions with the semantically relevant portions of natural images that they refer to. We demonstrateExpand
Self-supervised Audio-visual Co-segmentation
TLDR
This paper develops a neural network model for visual object segmentation and sound source separation that learns from natural videos through self-supervision, and introduces a learning approach to disentangle concepts in the neural networks. Expand
Grounding Spoken Words in Unlabeled Video
TLDR
Deep learning models that learn joint multi-modal embeddings in videos where the audio and visual streams are loosely synchronized are explored, and with weak supervision the authors see significant amounts of cross- modal learning. Expand
Large-scale representation learning from visually grounded untranscribed speech
TLDR
A scalable method to automatically generate diverse audio for image captioning datasets via a dual encoder that learns to align latent representations from both modalities is described and it is shown that a masked margin softmax loss for such models is superior to the standard triplet loss. Expand
Language learning using Speech to Image retrieval
TLDR
This work improves on existing neural network approaches to create visually grounded embeddings for spoken utterances and shows that the visually grounded sentence encoder learns to recognise words from the input even though it is not explicitly trained for word recognition. Expand
Evaluation of Audio-Visual Alignments in Visually Grounded Speech Models
TLDR
This work formalizes the alignment problem in terms of an audiovisual alignment tensor that is based on earlier VGS work, introduces systematic metrics for evaluating model performance in aligning visual objects and spoken words, and proposes a new VGS model variant for the alignment task utilizing cross-modal attention layer. Expand
Transcription-Enriched Joint Embeddings for Spoken Descriptions of Images and Videos
TLDR
The proposed triad speech, image and words allows for a better estimate of the point embedding and shows an improving of the performance within tasks like image and speech retrieval, even when text third modality, text, is not present in the task. Expand
Word Recognition, Competition, and Activation in a Model of Visually Grounded Speech
TLDR
It is found that not all speech frames play an equal role in the final encoded representation of a given word, but that some frames have a crucial effect on it and it is suggested that word representation could be activated through a process of lexical competition. Expand
Towards Visually Grounded Sub-word Speech Unit Discovery
  • David F. Harwath, James R. Glass
  • Computer Science, Engineering
  • ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2019
TLDR
It is shown how diphone boundaries can be superficially extracted from the activation patterns of intermediate layers of the model, suggesting that the model may be leveraging these events for the purpose of word recognition. Expand
Semantic Speech Retrieval With a Visually Grounded Model of Untranscribed Speech
TLDR
This work studies how a visually grounded speech model, trained on images of scenes paired with spoken captions, captures aspects of semantics, and introduces a newly collected data set of human semantic relevance judgements and an associated task, semantic speech retrieval, where the goal is to search for spoken utterances that are semantically relevant to a given text query. Expand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 62 REFERENCES
Learning Word-Like Units from Joint Audio-Visual Analysis
TLDR
This model effectively implements a form of spoken language acquisition, in which the computer learns not only to recognize word categories by sound, but also to enrich the words it learns with semantics by grounding them in images. Expand
Unsupervised Learning of Spoken Language with Visual Context
TLDR
A deep neural network model capable of rudimentary spoken language acquisition using untranscribed audio training data, whose only supervision comes in the form of contextually relevant visual images, is presented. Expand
Visually Grounded Learning of Keyword Prediction from Untranscribed Speech
TLDR
This work uses an image-to-words multi-label visual classifier to tag images with soft textual labels, and then trains a neural network to map from the speech to these soft targets, and shows that the resulting speech system is able to predict which words occur in an utterance without seeing any parallel speech and text. Expand
Unsupervised Visual Representation Learning by Context Prediction
TLDR
It is demonstrated that the feature representation learned using this within-image context indeed captures visual similarity across images and allows us to perform unsupervised visual discovery of objects like cats, people, and even birds from the Pascal VOC 2011 detection dataset. Expand
Representations of language in a model of visually grounded speech signal
TLDR
An in-depth analysis of the representations used by different components of the trained model shows that encoding of semantic aspects tends to become richer as the authors go up the hierarchy of layers, whereas encoding of form-related aspects of the language input tends to initially increase and then plateau or decrease. Expand
Look, Listen and Learn
TLDR
There is a valuable, but so far untapped, source of information contained in the video itself – the correspondence between the visual and the audio streams, and a novel “Audio-Visual Correspondence” learning task that makes use of this. Expand
From phonemes to images: levels of representation in a recurrent neural model of visually-grounded language learning
TLDR
This work presents a model of visually-grounded language learning based on stacked gated recurrent neural networks which learns to predict visual features given an image description in the form of a sequence of phonemes, and shows that it represents linguistic information in a hierarchy of levels. Expand
Learning words from sights and sounds: a computational model
TLDR
The model successfully performed speech segmentation, word discovery and visual categorization from spontaneous infant-directed speech paired with video images of single objects, demonstrating the possibility of using state-of-the-art techniques from sensory pattern recognition and machine learning to implement cognitive models which can process raw sensor data without the need for human transcription or labeling. Expand
From captions to visual concepts and back
TLDR
This paper uses multiple instance learning to train visual detectors for words that commonly occur in captions, including many different parts of speech such as nouns, verbs, and adjectives, and develops a maximum-entropy language model. Expand
Analysis of Audio-Visual Features for Unsupervised Speech Recognition
TLDR
This work uses a dataset of paired images and audio captions to supervise learning of low-level speech features that can be used for further “unsupervised” processing of any speech data, and shows that visual grounding can improve speech representations for a variety of zero-resource tasks. Expand
...
1
2
3
4
5
...