• Publications
  • Influence
Unsupervised Learning of Spoken Language with Visual Context
A deep neural network model capable of rudimentary spoken language acquisition using untranscribed audio training data, whose only supervision comes in the form of contextually relevant visual images, is presented. Expand
Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input
In this paper, we explore neural network models that learn to associate segments of spoken audio captions with the semantically relevant portions of natural images that they refer to. We demonstrateExpand
Deep multimodal semantic embeddings for speech and images
A model which takes as input a corpus of images with relevant spoken captions and finds a correspondence between the two modalities and ties the networks together with an embedding and alignment model which learns a joint semantic space over both modalities. Expand
Learning Word-Like Units from Joint Audio-Visual Analysis
This model effectively implements a form of spoken language acquisition, in which the computer learns not only to recognize word categories by sound, but also to enrich the words it learns with semantics by grounding them in images. Expand
Learning Hierarchical Discrete Linguistic Units from Visually-Grounded Speech
In this paper, we present a method for learning discrete linguistic units by incorporating vector quantization layers into neural models of visually grounded speech. We show that our method isExpand
Vision as an Interlingua: Learning Multilingual Semantic Embeddings of Untranscribed Speech
This work uses spoken captions collected in English and Hindi to show that the same model architecture can be successfully applied to both languages, and shows that these models are capable of performing semantic cross-lingual speech-to-speech retrieval. Expand
A summary of the 2012 JHU CLSP workshop on zero resource speech technologies and models of early language acquisition
Centered around the tasks of phonetic and lexical discovery, unified evaluation metrics are considered, two new approaches for improving speaker independence in the absence of supervision are presented, and the application of Bayesian word segmentation algorithms to automatic subword unit tokenizations is evaluated. Expand
Grounding Spoken Words in Unlabeled Video
Deep learning models that learn joint multi-modal embeddings in videos where the audio and visual streams are loosely synchronized are explored, and with weak supervision the authors see significant amounts of cross- modal learning. Expand
Towards Visually Grounded Sub-word Speech Unit Discovery
It is shown how diphone boundaries can be superficially extracted from the activation patterns of intermediate layers of the model, suggesting that the model may be leveraging these events for the purpose of word recognition. Expand
Text-Free Image-to-Speech Synthesis Using Learned Segmental Units
This paper connects the image captioning module and the speech synthesis module with a set of discrete, sub-word speech units that are discovered with a self-supervised visual grounding task, and finds that the representation must satisfy several important properties to serve as drop-in replacements for text. Expand