Corpus ID: 208202182

Learning Hierarchical Discrete Linguistic Units from Visually-Grounded Speech

@article{Harwath2020LearningHD,
  title={Learning Hierarchical Discrete Linguistic Units from Visually-Grounded Speech},
  author={David F. Harwath and Wei-Ning Hsu and James R. Glass},
  journal={ArXiv},
  year={2020},
  volume={abs/1911.09602}
}
In this paper, we present a method for learning discrete linguistic units by incorporating vector quantization layers into neural models of visually grounded speech. We show that our method is capable of capturing both word-level and sub-word units, depending on how it is configured. What differentiates this paper from prior work on speech unit learning is the choice of training objective. Rather than using a reconstruction-based loss, we use a discriminative, multimodal grounding objective… Expand
Text-Free Image-to-Speech Synthesis Using Learned Segmental Units
TLDR
This paper connects the image captioning module and the speech synthesis module with a set of discrete, sub-word speech units that are discovered with a self-supervised visual grounding task, and finds that the representation must satisfy several important properties to serve as drop-in replacements for text. Expand
A Hierarchical Subspace Model for Language-Attuned Acoustic Unit Discovery
TLDR
This work frames the task as one of learning embeddings on a low-dimensional phonetic subspace, and simultaneously specifies the subspace itself as an embedding on a hyper-subspace. Expand
Attention-Based Keyword Localisation in Speech using Visual Grounding
TLDR
This work investigates whether visually grounded speech models can also do keyword localisation: predicting where, within an utterance, a given textual keyword occurs without any explicit text-based or alignment supervision. Expand
Cross-Modal Discrete Representation Learning
TLDR
This work presents a self-supervised learning framework that is able to learn a representation that captures finer levels of granularity across different modalities such as concepts or events represented by visual objects or spoken words. Expand
DeCoAR 2.0: Deep Contextualized Acoustic Representations with Vector Quantization
TLDR
This work proposes DeCoAR 2.0, a Deep Contextualized Acoustic Representation with vector quantization, which uses Transformers in encoding module instead of LSTMs and proposes an objective that combines the reconstructive loss withvector quantization diversity loss to train speech representations. Expand
Learning to Recognise Words Using Visually Grounded Speech
TLDR
The experiments show that the model is able to recognise words, and the gating paradigm reveals that words can be recognised from partial input as well and that recognition is negatively influenced by word competition from the word initial cohort. Expand
Direct multimodal few-shot learning of speech and images
TLDR
In a speech-to-image digit matching task, direct models outperform indirect models, with the MTriplet achieving the best multimodal five-shot accuracy and improvements are due to the combination of unsupervised and transfer learning in the direct models, and the absence of two-step compounding errors. Expand
Towards Semi-Supervised Semantics Understanding from Speech
TLDR
Experiments show that the proposed SLU framework with speech as input can perform on par with those with oracle text as input in semantics understanding, while environmental noises are present, and a limited amount of labeled semantics data is available. Expand
Speech-Image Semantic Alignment Does Not Depend on Any Prior Classification Tasks
TLDR
Semantically-aligned (speech, image) datasets can be used to explore “visually-grounded speech” and choose appropriate neural architectures for encoders in the speech and image branches and using large datasets, one can obtain competitive recall rates without any reliance on any pretrained initialization or feature extraction. Expand
CLSRIL-23: Cross Lingual Speech Representations for Indic Languages
TLDR
This work presents a self supervised learning based audio pre-trained model which learns cross lingual speech representations from raw audio across 23 Indic languages and shows that multilingual pretraining outperforms monolingual training, in terms of learning speech representations which encodes phonetic similarity of languages and also on down stream tasks. Expand
...
1
2
3
4
...

References

SHOWING 1-10 OF 76 REFERENCES
Word Recognition, Competition, and Activation in a Model of Visually Grounded Speech
TLDR
It is found that not all speech frames play an equal role in the final encoded representation of a given word, but that some frames have a crucial effect on it and it is suggested that word representation could be activated through a process of lexical competition. Expand
Visually Grounded Learning of Keyword Prediction from Untranscribed Speech
TLDR
This work uses an image-to-words multi-label visual classifier to tag images with soft textual labels, and then trains a neural network to map from the speech to these soft targets, and shows that the resulting speech system is able to predict which words occur in an utterance without seeing any parallel speech and text. Expand
Language learning using Speech to Image retrieval
TLDR
This work improves on existing neural network approaches to create visually grounded embeddings for spoken utterances and shows that the visually grounded sentence encoder learns to recognise words from the input even though it is not explicitly trained for word recognition. Expand
Large-scale representation learning from visually grounded untranscribed speech
TLDR
A scalable method to automatically generate diverse audio for image captioning datasets via a dual encoder that learns to align latent representations from both modalities is described and it is shown that a masked margin softmax loss for such models is superior to the standard triplet loss. Expand
Learning modality-invariant representations for speech and images
TLDR
This paper focuses on the task of learning a semantic vector space for both spoken and handwritten digits using the TIDIGITs and MNIST datasets and includes a regularization term borrowed from variational autoencoders which drives the posterior distributions over embeddings to be unit Gaussian. Expand
Unsupervised End-to-End Learning of Discrete Linguistic Units for Voice Conversion
TLDR
An unsupervised end-to-end training scheme where discrete subword units from speech are discovered without using any labels and the approach offers strong VC results as it eliminates speaker identity while preserving content within speech. Expand
Learning Word-Like Units from Joint Audio-Visual Analysis
TLDR
This model effectively implements a form of spoken language acquisition, in which the computer learns not only to recognize word categories by sound, but also to enrich the words it learns with semantics by grounding them in images. Expand
Unsupervised Word Segmentation and Lexicon Discovery Using Acoustic Word Embeddings
TLDR
A novel unsupervised Bayesian model that segments unlabeled speech and clusters the segments into hypothesized word groupings is presented, resulting in a complete unsuper supervised tokenization of the input speech in terms of discovered word types. Expand
Transfer Learning from Audio-Visual Grounding to Speech Recognition
TLDR
This paper proposes a novel transfer learning scenario, which distills robust phonetic features from grounding models that are trained to tell whether a pair of image and speech are semantically correlated, without using any textual transcripts. Expand
A segmental framework for fully-unsupervised large-vocabulary speech recognition
TLDR
This article presents the first attempt to apply a Bayesian modelling framework with segmental word representations to large-vocabulary multi-speaker data and shows that by imposing a consistent top-down segmentation while also using bottom-up knowledge from detected syllable boundaries, both single-speakers and multi-Speaker versions of this system outperform a purely bottom- up single- Speaker syllable-based approach. Expand
...
1
2
3
4
5
...