Sound-Word2Vec: Learning Word Representations Grounded in Sounds

@inproceedings{Vijayakumar2017SoundWord2VecLW,
  title={Sound-Word2Vec: Learning Word Representations Grounded in Sounds},
  author={Ashwin K. Vijayakumar and Ramakrishna Vedantam and Devi Parikh},
  booktitle={EMNLP},
  year={2017}
}
To be able to interact better with humans, it is crucial for machines to understand sound – a primary modality of human perception. Previous works have used sound to learn embeddings for improved generic semantic similarity assessment. In this work, we treat sound as a first-class citizen, studying downstream 6textual tasks which require aural grounding. To this end, we propose sound-word2vec – a new embedding scheme that learns specialized word embeddings grounded in sounds. For example, we… 

Figures and Tables from this paper

Learning Neural Audio Embeddings for Grounding Semantics in Auditory Perception
TLDR
This paper examines grounding semantic representations in raw auditory data, using standard evaluations for multi-modal semantics, and shows how they can be applied to tasks where auditory perception is relevant, including two unsupervised categorization experiments.
Word2vec to behavior: morphology facilitates the grounding of language in machines.
TLDR
It is shown that inducing such an alignment between motoric and linguistic similarities can be facilitated or hindered by the mechanical structure of the robot, and points to future, large scale methods that find and exploit relationships between action, language, and robot structure.
Visually Grounded Learning of Keyword Prediction from Untranscribed Speech
TLDR
This work uses an image-to-words multi-label visual classifier to tag images with soft textual labels, and then trains a neural network to map from the speech to these soft targets, and shows that the resulting speech system is able to predict which words occur in an utterance without seeing any parallel speech and text.
Learning Multimodal Word Representations by Explicitly Embedding Syntactic and Phonetic Information
TLDR
This article proposes an effective multimodal word representation model that uses two gate mechanisms to explicitly embed syntactic and phonetic information into multi-modal representations and uses supervised learning to train the model.
Effect of Text Color on Word Embeddings
TLDR
This paper quantifies the color distribution of words from the book cover images and analyzes the correlation between the color and meaning of the word to verify the usefulness of text color in understanding the meanings of words, especially in identifying synonyms and antonyms.
Using Pause Information for More Accurate Entity Recognition
TLDR
It is demonstrated that the linguistic observation on pauses can be used to improve accuracy in machine-learnt language understanding tasks and proposed novel embeddings improve the relative error rate by up to 8% consistently across three domains for French, without any added annotation or alignment costs to the parser.
A Synchronized Word Representation Method With Dual Perceptual Information
TLDR
A language model is proposed that synchronously trains dual perceptual information to enhance word representation in a synchronized way that adopts an attention model to utilize both text and phonetic perceptual information in unsupervised learning tasks.
Improving Visually Grounded Sentence Representations with Self-Attention
TLDR
The results on transfer tasks show that self-attentive encoders are better for visual grounding, as they exploit specific words with strong visual associations.
Grounding ‘Grounding’ in NLP
TLDR
This work investigates the gap between definitions of “grounding” in NLP and Cognitive Science, and presents ways to both create new tasks or repurpose existing ones to make advancements towards achieving a more complete sense of grounding.
Enhanced Double-Carrier Word Embedding via Phonetics and Writing
TLDR
This work proposed double-carrier word embedding (DCWE), which can be applied to most languages, and selected Chinese, English, and Spanish as examples and evaluated these models through word similarity and text classification experiments.
...
1
2
3
...

References

SHOWING 1-10 OF 34 REFERENCES
VisualWord2Vec (Vis-W2V): Learning Visually Grounded Word Embeddings Using Abstract Scenes
TLDR
A model to learn visually grounded word embeddings (vis-w2v) to capture visual notions of semantic relatedness and shows improvements over text-only word embedDings (word2vec) on three tasks: common-sense assertion classification, visual paraphrasing and text-based image retrieval.
Discriminative acoustic word embeddings: Tecurrent neural network-based approaches
TLDR
This paper presents new discriminative embedding models based on recurrent neural networks (RNNs) and considers training losses that have been successful in prior work, in particular a cross entropy loss for word classification and a contrastive loss that explicitly aims to separate same-word and different-word pairs in a “Siamese network” training setting.
Sound-based distributional models
TLDR
The first results of the efforts to build a perceptually grounded semantic model based on sound data collected from freesound.org show that the models are able to capture semantic relatedness, with the tag- based model scoring higher than the sound-based model and the combined model.
Multi-view Recurrent Neural Acoustic Word Embeddings
TLDR
This work takes a multi-view approach to learning acoustic word embeddings, in which they jointly learn to embed acoustic sequences and their corresponding character sequences and the effect of different loss variants, including fixed-margin and cost-sensitive losses.
Combining Language and Vision with a Multimodal Skip-gram Model
TLDR
Since they propagate visual information to all words, the MMSKIP-GRAM models discover intriguing visual properties of abstract words, paving the way to realistic implementations of embodied theories of meaning.
Multimodal Distributional Semantics
TLDR
This work proposes a flexible architecture to integrate text- and image-based distributional information, and shows in a set of empirical tests that the integrated model is superior to the purely text-based approach, and it provides somewhat complementary semantic information with respect to the latter.
Audio Word2Vec: Unsupervised Learning of Audio Segment Representations Using Sequence-to-Sequence Autoencoder
TLDR
This paper proposes unsupervised learning of Audio Word2Vec from audio data without human annotation using Sequence-to-sequence Audoencoder (SA), which significantly outperformed the conventional Dynamic Time Warping (DTW) based approaches at significantly lower computation requirements.
The Role of Context Types and Dimensionality in Learning Word Embeddings
We provide the first extensive evaluation of how using different types of context to learn skip-gram word embeddings affects performance on a wide range of intrinsic and extrinsic NLP tasks. Our
SimLex-999: Evaluating Semantic Models With (Genuine) Similarity Estimation
TLDR
SimLex-999 is presented, a gold standard resource for evaluating distributional semantic models that improves on existing resources in several important ways, and explicitly quantifies similarity rather than association or relatedness so that pairs of entities that are associated but not actually similar have a low rating.
Distributed Representations of Words and Phrases and their Compositionality
TLDR
This paper presents a simple method for finding phrases in text, and shows that learning good vector representations for millions of phrases is possible and describes a simple alternative to the hierarchical softmax called negative sampling.
...
1
2
3
4
...