Learning Image Embeddings using Convolutional Neural Networks for Improved Multi-Modal Semantics
@inproceedings{Kiela2014LearningIE, title={Learning Image Embeddings using Convolutional Neural Networks for Improved Multi-Modal Semantics}, author={Douwe Kiela and L{\'e}on Bottou}, booktitle={EMNLP}, year={2014} }
We construct multi-modal concept representations by concatenating a skip-gram linguistic representation vector with a visual concept representation vector computed using the feature extraction layers of a deep convolutional neural network (CNN) trained on a large labeled object recognition dataset. [] Key Result Experimental results are reported on the WordSim353 and MEN semantic relatedness evaluation tasks. We use visual features computed using either ImageNet or ESP Game images.
199 Citations
Multimodal Skipgram Using Convolutional Pseudowords
- Computer ScienceArXiv
- 2015
This work introduces a simplified training objective for learning multimodal embeddings using the skip-gram architecture by introducing convolutional "pseudowords:" embedDings composed of the additive combination of distributed word representations and image features from convolutionAL neural networks projected into the multi-modality space.
Comparing Data Sources and Architectures for Deep Visual Representation Learning in Semantics
- Computer ScienceEMNLP
- 2016
This study systematically compare deep visual representation learning techniques, experimenting with three well-known network architectures, and explores the optimal number of images and the multi-lingual applicability of multi-modal semantics.
Incorporating visual features into word embeddings: A bimodal autoencoder-based approach
- Computer ScienceIWCS
- 2017
A novel bimodal autoencoder model for multimodal representation learning that learns in order to enhance linguistic feature vectors by incorporating the corresponding visual features, and investigates the potential efficacy of the enhanced word embeddings in discriminating antonyms and synonyms from vaguely related words.
Learning Multi-Modal Word Representation Grounded in Visual Context
- Computer ScienceAAAI
- 2018
This work explores various choices for what can serve as a visual context and presents an end-to-end method to integrate visual context elements in a multimodal skip-gram model and provides experiments and extensive analysis of the obtained results.
Learning Fused Representations for Large-Scale Multimodal Classification
- Computer ScienceIEEE Sensors Letters
- 2019
A novel multimodal approach that fuses image and encoded text description to obtain an information-enriched image and indicates that the joint representation of encoded text and image in feature space improves the multimodals classification performance aiding the interpretability.
Learning Neural Audio Embeddings for Grounding Semantics in Auditory Perception
- Computer ScienceJ. Artif. Intell. Res.
- 2017
This paper examines grounding semantic representations in raw auditory data, using standard evaluations for multi-modal semantics, and shows how they can be applied to tasks where auditory perception is relevant, including two unsupervised categorization experiments.
A Probabilistic Model for Joint Learning of Word Embeddings from Texts and Images
- Computer ScienceEMNLP
- 2018
A novel probabilistic model is proposed to formalize how linguistic and perceptual inputs can work in concert to explain the observed word-context pairs in a text corpus and attains equally competitive or stronger results when compared to other state-of-the-art multimodal models.
Visual Bilingual Lexicon Induction with Transferred ConvNet Features
- Computer ScienceEMNLP
- 2015
By applying features from a convolutional neural network to the task of bilingual lexicon induction using imagebased features, state-of-the-art performance is obtained on a standard dataset, obtaining a 79% relative improvement over previous work which uses bags of visual words based on SIFT features.
Optimizing Visual Representations in Semantic Multi-modal Models with Dimensionality Reduction, Denoising and Contextual Information
- Computer ScienceGSCL
- 2017
This paper improves visual representations for multi-modal semantic models, by (i) applying standard dimensionality reduction and denoising techniques, and by (ii) proposing a novel technique \(…
Image and Encoded Text Fusion for Multi-Modal Classification
- Computer Science2018 Digital Image Computing: Techniques and Applications (DICTA)
- 2018
This paper presents a novel multi-modal approach that fuses images and text descriptions to improve multi- modal classification performance in real-world scenarios and evaluates the approach against two famous multi-Modal strategies namely early fusion and late fusion.
References
SHOWING 1-10 OF 48 REFERENCES
DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition
- Computer ScienceICML
- 2014
DeCAF, an open-source implementation of deep convolutional activation features, along with all associated network parameters, are released to enable vision researchers to be able to conduct experimentation with deep representations across a range of visual concept learning paradigms.
Learning Grounded Meaning Representations with Autoencoders
- Computer ScienceACL
- 2014
A new model is introduced which uses stacked autoencoders to learn higher-level embeddings from textual and visual input and which outperforms baselines and related models on similarity judgments and concept categorization.
Learning and Transferring Mid-level Image Representations Using Convolutional Neural Networks
- Computer Science2014 IEEE Conference on Computer Vision and Pattern Recognition
- 2014
This work designs a method to reuse layers trained on the ImageNet dataset to compute mid-level image representation for images in the PASCAL VOC dataset, and shows that despite differences in image statistics and tasks in the two datasets, the transferred representation leads to significantly improved results for object and action classification.
DeViSE: A Deep Visual-Semantic Embedding Model
- Computer ScienceNIPS
- 2013
This paper presents a new deep visual-semantic embedding model trained to identify visual objects using both labeled image data as well as semantic information gleaned from unannotated text and shows that the semantic information can be exploited to make predictions about tens of thousands of image labels not observed during training.
Multimodal learning with deep Boltzmann machines
- Computer ScienceJ. Mach. Learn. Res.
- 2012
A Deep Boltzmann Machine is proposed for learning a generative model of multimodal data and it is shown that the model can be used to create fused representations by combining features across modalities, which are useful for classification and information retrieval.
Improving Multi-Modal Representations Using Image Dispersion: Why Less is Sometimes More
- Computer ScienceACL
- 2014
An unsupervised method to determine whether to include perceptual input for a concept is proposed, and it is shown that it significantly improves the ability of multi-modal models to learn and represent word meanings.
Grounded Compositional Semantics for Finding and Describing Images with Sentences
- Computer ScienceTransactions of the Association for Computational Linguistics
- 2014
The DT-RNN model, which uses dependency trees to embed sentences into a vector space in order to retrieve images that are described by those sentences, outperform other recursive and recurrent neural networks, kernelized CCA and a bag-of-words baseline on the tasks of finding an image that fits a sentence description and vice versa.
Visualizing and Understanding Convolutional Networks
- Computer ScienceECCV
- 2014
A novel visualization technique is introduced that gives insight into the function of intermediate feature layers and the operation of the classifier in large Convolutional Network models, used in a diagnostic role to find model architectures that outperform Krizhevsky et al on the ImageNet classification benchmark.
Online multimodal deep similarity learning with application to image retrieval
- Computer ScienceACM Multimedia
- 2013
This paper proposes a novel framework of online multimodal deep similarity learning (OMDSL), which aims to optimally integrate multiple deep neural networks pretrained with stacked denoising autoencoder to improve similarity search in multimedia information retrieval tasks.
CNN Features Off-the-Shelf: An Astounding Baseline for Recognition
- Computer Science2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops
- 2014
A series of experiments conducted for different recognition tasks using the publicly available code and model of the OverFeat network which was trained to perform object classification on ILSVRC13 suggest that features obtained from deep learning with convolutional nets should be the primary candidate in most visual recognition tasks.