• Corpus ID: 3948398

Sensory-Aware Multimodal Fusion for Word Semantic Similarity Estimation

  title={Sensory-Aware Multimodal Fusion for Word Semantic Similarity Estimation},
  author={Georgios Paraskevopoulos and Giannis Karamanolakis and Elias Iosif and Aggelos Pikrakis and Alexandros Potamianos},
Traditional semantic models are disembodied from the human perception and action. In this work, we attempt to address this problem by grounding semantic representations of words to the acoustic and visual modalities. Specifically we estimate multimodal word representations via the fusion of auditory and visual modalities with the text modality. We employ middle and late fusion of representations with modality weights assigned to each of the unimodal representations. We also propose a fusion… 

Tables from this paper


Deep embodiment: grounding semantics in perceptual modalities
This thesis shows that transferred convolutional neural network representations outperform the traditional bag of visual words method for obtaining visual features and shows that these representations may be applied successfully to various natural language processing tasks.
Combining Language and Vision with a Multimodal Skip-gram Model
Since they propagate visual information to all words, the MMSKIP-GRAM models discover intriguing visual properties of abstract words, paving the way to realistic implementations of embodied theories of meaning.
Multi- and Cross-Modal Semantics Beyond Vision: Grounding in Auditory Perception
This work examines grounding semantic representations in raw auditory data, using standard evaluations for multi-modal semantics, including measuring conceptual similarity and relatedness, through a zero-shot learning task mapping between linguistic and auditory modalities.
Crossmodal Network-Based Distributional Semantic Models
This work proposes the crossmodal extension of a two-tier text-based model, where semantic representations are encoded in the first layer, while the second layer is used for computing similarity between words.
Multimodal Distributional Semantics
This work proposes a flexible architecture to integrate text- and image-based distributional information, and shows in a set of empirical tests that the integrated model is superior to the purely text-based approach, and it provides somewhat complementary semantic information with respect to the latter.
Learning Image Embeddings using Convolutional Neural Networks for Improved Multi-Modal Semantics
We construct multi-modal concept representations by concatenating a skip-gram linguistic representation vector with a visual concept representation vector computed using the feature extraction layers
Audio-Based Distributional Representations of Meaning Using a Fusion of Feature Encodings
This work constructs an ADSM model in order to compute the distance between words (lexical semantic similarity task) and is shown to significantly outperform the state-of-the-art results reported in the literature.
Sound-based distributional models
The first results of the efforts to build a perceptually grounded semantic model based on sound data collected from freesound.org show that the models are able to capture semantic relatedness, with the tag- based model scoring higher than the sound-based model and the combined model.
Audio-based Distributional Semantic Models for Music Auto-tagging and Similarity Measurement
Acoustic-semantic models are shown to outperform the state-of-the-art for this task and produce high quality tags for audio/music clips.
Distributional Semantics in Technicolor
While visual models with state-of-the-art computer vision techniques perform worse than textual models in general tasks, they are as good or better models of the meaning of words with visual correlates such as color terms, even in a nontrivial task that involves nonliteral uses of such words.