Sound-Word2Vec: Learning Word Representations Grounded in Sounds
@inproceedings{Vijayakumar2017SoundWord2VecLW, title={Sound-Word2Vec: Learning Word Representations Grounded in Sounds}, author={Ashwin K. Vijayakumar and Ramakrishna Vedantam and Devi Parikh}, booktitle={EMNLP}, year={2017} }
To be able to interact better with humans, it is crucial for machines to understand sound – a primary modality of human perception. Previous works have used sound to learn embeddings for improved generic semantic similarity assessment. In this work, we treat sound as a first-class citizen, studying downstream 6textual tasks which require aural grounding. To this end, we propose sound-word2vec – a new embedding scheme that learns specialized word embeddings grounded in sounds. For example, we…
21 Citations
Learning Neural Audio Embeddings for Grounding Semantics in Auditory Perception
- Computer ScienceJ. Artif. Intell. Res.
- 2017
This paper examines grounding semantic representations in raw auditory data, using standard evaluations for multi-modal semantics, and shows how they can be applied to tasks where auditory perception is relevant, including two unsupervised categorization experiments.
Word2vec to behavior: morphology facilitates the grounding of language in machines.
- Computer Science2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)
- 2019
It is shown that inducing such an alignment between motoric and linguistic similarities can be facilitated or hindered by the mechanical structure of the robot, and points to future, large scale methods that find and exploit relationships between action, language, and robot structure.
Visually Grounded Learning of Keyword Prediction from Untranscribed Speech
- Computer ScienceINTERSPEECH
- 2017
This work uses an image-to-words multi-label visual classifier to tag images with soft textual labels, and then trains a neural network to map from the speech to these soft targets, and shows that the resulting speech system is able to predict which words occur in an utterance without seeing any parallel speech and text.
Learning Multimodal Word Representations by Explicitly Embedding Syntactic and Phonetic Information
- Computer ScienceIEEE Access
- 2020
This article proposes an effective multimodal word representation model that uses two gate mechanisms to explicitly embed syntactic and phonetic information into multi-modal representations and uses supervised learning to train the model.
Effect of Text Color on Word Embeddings
- Computer Science, LinguisticsDAS
- 2020
This paper quantifies the color distribution of words from the book cover images and analyzes the correlation between the color and meaning of the word to verify the usefulness of text color in understanding the meanings of words, especially in identifying synonyms and antonyms.
Using Pause Information for More Accurate Entity Recognition
- Computer ScienceNLP4CONVAI
- 2021
It is demonstrated that the linguistic observation on pauses can be used to improve accuracy in machine-learnt language understanding tasks and proposed novel embeddings improve the relative error rate by up to 8% consistently across three domains for French, without any added annotation or alignment costs to the parser.
A Synchronized Word Representation Method With Dual Perceptual Information
- Computer ScienceIEEE Access
- 2020
A language model is proposed that synchronously trains dual perceptual information to enhance word representation in a synchronized way that adopts an attention model to utilize both text and phonetic perceptual information in unsupervised learning tasks.
Improving Visually Grounded Sentence Representations with Self-Attention
- Computer ScienceArXiv
- 2017
The results on transfer tasks show that self-attentive encoders are better for visual grounding, as they exploit specific words with strong visual associations.
Grounding ‘Grounding’ in NLP
- Computer ScienceFINDINGS
- 2021
This work investigates the gap between definitions of “grounding” in NLP and Cognitive Science, and presents ways to both create new tasks or repurpose existing ones to make advancements towards achieving a more complete sense of grounding.
Enhanced Double-Carrier Word Embedding via Phonetics and Writing
- Linguistics, Computer ScienceACM Trans. Asian Low Resour. Lang. Inf. Process.
- 2020
This work proposed double-carrier word embedding (DCWE), which can be applied to most languages, and selected Chinese, English, and Spanish as examples and evaluated these models through word similarity and text classification experiments.
References
SHOWING 1-10 OF 34 REFERENCES
VisualWord2Vec (Vis-W2V): Learning Visually Grounded Word Embeddings Using Abstract Scenes
- Computer Science2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
- 2016
A model to learn visually grounded word embeddings (vis-w2v) to capture visual notions of semantic relatedness and shows improvements over text-only word embedDings (word2vec) on three tasks: common-sense assertion classification, visual paraphrasing and text-based image retrieval.
Discriminative acoustic word embeddings: Tecurrent neural network-based approaches
- Computer Science2016 IEEE Spoken Language Technology Workshop (SLT)
- 2016
This paper presents new discriminative embedding models based on recurrent neural networks (RNNs) and considers training losses that have been successful in prior work, in particular a cross entropy loss for word classification and a contrastive loss that explicitly aims to separate same-word and different-word pairs in a “Siamese network” training setting.
Sound-based distributional models
- Computer ScienceIWCS
- 2015
The first results of the efforts to build a perceptually grounded semantic model based on sound data collected from freesound.org show that the models are able to capture semantic relatedness, with the tag- based model scoring higher than the sound-based model and the combined model.
Multi-view Recurrent Neural Acoustic Word Embeddings
- Computer ScienceICLR
- 2017
This work takes a multi-view approach to learning acoustic word embeddings, in which they jointly learn to embed acoustic sequences and their corresponding character sequences and the effect of different loss variants, including fixed-margin and cost-sensitive losses.
Combining Language and Vision with a Multimodal Skip-gram Model
- Computer ScienceNAACL
- 2015
Since they propagate visual information to all words, the MMSKIP-GRAM models discover intriguing visual properties of abstract words, paving the way to realistic implementations of embodied theories of meaning.
Multimodal Distributional Semantics
- Computer ScienceJ. Artif. Intell. Res.
- 2014
This work proposes a flexible architecture to integrate text- and image-based distributional information, and shows in a set of empirical tests that the integrated model is superior to the purely text-based approach, and it provides somewhat complementary semantic information with respect to the latter.
Audio Word2Vec: Unsupervised Learning of Audio Segment Representations Using Sequence-to-Sequence Autoencoder
- Computer ScienceINTERSPEECH
- 2016
This paper proposes unsupervised learning of Audio Word2Vec from audio data without human annotation using Sequence-to-sequence Audoencoder (SA), which significantly outperformed the conventional Dynamic Time Warping (DTW) based approaches at significantly lower computation requirements.
The Role of Context Types and Dimensionality in Learning Word Embeddings
- Computer ScienceNAACL
- 2016
We provide the first extensive evaluation of how using different types of context to learn skip-gram word embeddings affects performance on a wide range of intrinsic and extrinsic NLP tasks. Our…
SimLex-999: Evaluating Semantic Models With (Genuine) Similarity Estimation
- Computer ScienceCL
- 2015
SimLex-999 is presented, a gold standard resource for evaluating distributional semantic models that improves on existing resources in several important ways, and explicitly quantifies similarity rather than association or relatedness so that pairs of entities that are associated but not actually similar have a low rating.
Distributed Representations of Words and Phrases and their Compositionality
- Computer ScienceNIPS
- 2013
This paper presents a simple method for finding phrases in text, and shows that learning good vector representations for millions of phrases is possible and describes a simple alternative to the hierarchical softmax called negative sampling.