Language learning using Speech to Image retrieval

@article{Merkx2019LanguageLU,
  title={Language learning using Speech to Image retrieval},
  author={Danny Merkx and S. Frank and M. Ernestus},
  journal={ArXiv},
  year={2019},
  volume={abs/1909.03795}
}
Humans learn language by interaction with their environment and listening to other humans. [...] Key Result This shows that our visually grounded sentence encoder learns to recognise words from the input even though it is not explicitly trained for word recognition.Expand
Learning to Recognise Words Using Visually Grounded Speech
TLDR
The experiments show that the model is able to recognise words, and the gating paradigm reveals that words can be recognised from partial input as well and that recognition is negatively influenced by word competition from the word initial cohort. Expand
Word Recognition, Competition, and Activation in a Model of Visually Grounded Speech
TLDR
It is found that not all speech frames play an equal role in the final encoded representation of a given word, but that some frames have a crucial effect on it and it is suggested that word representation could be activated through a process of lexical competition. Expand
Text-Free Image-to-Speech Synthesis Using Learned Segmental Units
TLDR
This paper connects the image captioning module and the speech synthesis module with a set of discrete, sub-word speech units that are discovered with a self-supervised visual grounding task, and finds that the representation must satisfy several important properties to serve as drop-in replacements for text. Expand
Learning Hierarchical Discrete Linguistic Units from Visually-Grounded Speech
In this paper, we present a method for learning discrete linguistic units by incorporating vector quantization layers into neural models of visually grounded speech. We show that our method isExpand
Generating Images From Spoken Descriptions
TLDR
A new speech technology task, i.e., a speech-to-image generation (S2IG) framework which translates speech descriptions to photo-realistic images without using any text information, thus allowing unwritten languages to potentially benefit from this technology. Expand
Textual supervision for visually grounded spoken language understanding
TLDR
Comparing different strategies, it is found that the pipeline approach works better when enough text is available and translations can be effectively used in place of transcriptions but more data is needed to obtain similar results. Expand
Evaluation of Audio-Visual Alignments in Visually Grounded Speech Models
TLDR
This work formalizes the alignment problem in terms of an audiovisual alignment tensor that is based on earlier VGS work, introduces systematic metrics for evaluating model performance in aligning visual objects and spoken words, and proposes a new VGS model variant for the alignment task utilizing cross-modal attention layer. Expand
Speech-Image Semantic Alignment Does Not Depend on Any Prior Classification Tasks
TLDR
Semantically-aligned (speech, image) datasets can be used to explore “visually-grounded speech” and choose appropriate neural architectures for encoders in the speech and image branches and using large datasets, one can obtain competitive recall rates without any reliance on any pretrained initialization or feature extraction. Expand
Direct multimodal few-shot learning of speech and images
TLDR
In a speech-to-image digit matching task, direct models outperform indirect models, with the MTriplet achieving the best multimodal five-shot accuracy and improvements are due to the combination of unsupervised and transfer learning in the direct models, and the absence of two-step compounding errors. Expand
Align or attend? Toward More Efficient and Accurate Spoken Word Discovery Using Speech-to-Image Retrieval
TLDR
The theoretical analysis shows that some kind of alignment/attention mechanism is crucial for a MWD system to learn meaningful word-level representation, and it is empirically demonstrated that both neural MT with self-attention and statistical MT achieve word discovery scores that are superior to those of a state-of-the-art neural retrieval system. Expand
...
1
2
3
...

References

SHOWING 1-10 OF 30 REFERENCES
Visually Grounded Learning of Keyword Prediction from Untranscribed Speech
TLDR
This work uses an image-to-words multi-label visual classifier to tag images with soft textual labels, and then trains a neural network to map from the speech to these soft targets, and shows that the resulting speech system is able to predict which words occur in an utterance without seeing any parallel speech and text. Expand
Unsupervised Learning of Spoken Language with Visual Context
TLDR
A deep neural network model capable of rudimentary spoken language acquisition using untranscribed audio training data, whose only supervision comes in the form of contextually relevant visual images, is presented. Expand
Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input
In this paper, we explore neural network models that learn to associate segments of spoken audio captions with the semantically relevant portions of natural images that they refer to. We demonstrateExpand
Analysis of Audio-Visual Features for Unsupervised Speech Recognition
TLDR
This work uses a dataset of paired images and audio captions to supervise learning of low-level speech features that can be used for further “unsupervised” processing of any speech data, and shows that visual grounding can improve speech representations for a variety of zero-resource tasks. Expand
Deep multimodal semantic embeddings for speech and images
TLDR
A model which takes as input a corpus of images with relevant spoken captions and finds a correspondence between the two modalities and ties the networks together with an embedding and alignment model which learns a joint semantic space over both modalities. Expand
Representations of language in a model of visually grounded speech signal
TLDR
An in-depth analysis of the representations used by different components of the trained model shows that encoding of semantic aspects tends to become richer as the authors go up the hierarchy of layers, whereas encoding of form-related aspects of the language input tends to initially increase and then plateau or decrease. Expand
Supervised Learning of Universal Sentence Representations from Natural Language Inference Data
TLDR
It is shown how universal sentence representations trained using the supervised data of the Stanford Natural Language Inference datasets can consistently outperform unsupervised methods like SkipThought vectors on a wide range of transfer tasks. Expand
Skip-Thought Vectors
We describe an approach for unsupervised learning of a generic, distributed sentence encoder. Using the continuity of text from books, we train an encoder-decoder model that tries to reconstruct theExpand
Learning semantic sentence representations from visually grounded language without lexical knowledge
TLDR
The system achieves state-of-the-art results on several of these benchmarks, which shows that a system trained solely on multimodal data, without assuming any word representations, is able to capture sentence level semantics. Expand
Multilingually trained bottleneck features in spoken language recognition
TLDR
By comparing properties of mono and multilingual features, the suitability of multilingual training for SLR is shown, and the state-of-the-art performance of these features is demonstrated on the NIST LRE09 database. Expand
...
1
2
3
...