Learning Image Embeddings using Convolutional Neural Networks for Improved Multi-Modal Semantics

@inproceedings{Kiela2014LearningIE,
  title={Learning Image Embeddings using Convolutional Neural Networks for Improved Multi-Modal Semantics},
  author={Douwe Kiela and L{\'e}on Bottou},
  booktitle={EMNLP},
  year={2014}
}
We construct multi-modal concept representations by concatenating a skip-gram linguistic representation vector with a visual concept representation vector computed using the feature extraction layers of a deep convolutional neural network (CNN) trained on a large labeled object recognition dataset. [] Key Result Experimental results are reported on the WordSim353 and MEN semantic relatedness evaluation tasks. We use visual features computed using either ImageNet or ESP Game images.

Figures and Tables from this paper

Multimodal Skipgram Using Convolutional Pseudowords
TLDR
This work introduces a simplified training objective for learning multimodal embeddings using the skip-gram architecture by introducing convolutional "pseudowords:" embedDings composed of the additive combination of distributed word representations and image features from convolutionAL neural networks projected into the multi-modality space.
Comparing Data Sources and Architectures for Deep Visual Representation Learning in Semantics
TLDR
This study systematically compare deep visual representation learning techniques, experimenting with three well-known network architectures, and explores the optimal number of images and the multi-lingual applicability of multi-modal semantics.
Incorporating visual features into word embeddings: A bimodal autoencoder-based approach
TLDR
A novel bimodal autoencoder model for multimodal representation learning that learns in order to enhance linguistic feature vectors by incorporating the corresponding visual features, and investigates the potential efficacy of the enhanced word embeddings in discriminating antonyms and synonyms from vaguely related words.
Learning Multi-Modal Word Representation Grounded in Visual Context
TLDR
This work explores various choices for what can serve as a visual context and presents an end-to-end method to integrate visual context elements in a multimodal skip-gram model and provides experiments and extensive analysis of the obtained results.
Learning Fused Representations for Large-Scale Multimodal Classification
TLDR
A novel multimodal approach that fuses image and encoded text description to obtain an information-enriched image and indicates that the joint representation of encoded text and image in feature space improves the multimodals classification performance aiding the interpretability.
Learning Neural Audio Embeddings for Grounding Semantics in Auditory Perception
TLDR
This paper examines grounding semantic representations in raw auditory data, using standard evaluations for multi-modal semantics, and shows how they can be applied to tasks where auditory perception is relevant, including two unsupervised categorization experiments.
A Probabilistic Model for Joint Learning of Word Embeddings from Texts and Images
TLDR
A novel probabilistic model is proposed to formalize how linguistic and perceptual inputs can work in concert to explain the observed word-context pairs in a text corpus and attains equally competitive or stronger results when compared to other state-of-the-art multimodal models.
Visual Bilingual Lexicon Induction with Transferred ConvNet Features
TLDR
By applying features from a convolutional neural network to the task of bilingual lexicon induction using imagebased features, state-of-the-art performance is obtained on a standard dataset, obtaining a 79% relative improvement over previous work which uses bags of visual words based on SIFT features.
Optimizing Visual Representations in Semantic Multi-modal Models with Dimensionality Reduction, Denoising and Contextual Information
This paper improves visual representations for multi-modal semantic models, by (i) applying standard dimensionality reduction and denoising techniques, and by (ii) proposing a novel technique \(
Image and Encoded Text Fusion for Multi-Modal Classification
TLDR
This paper presents a novel multi-modal approach that fuses images and text descriptions to improve multi- modal classification performance in real-world scenarios and evaluates the approach against two famous multi-Modal strategies namely early fusion and late fusion.
...
...

References

SHOWING 1-10 OF 48 REFERENCES
DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition
TLDR
DeCAF, an open-source implementation of deep convolutional activation features, along with all associated network parameters, are released to enable vision researchers to be able to conduct experimentation with deep representations across a range of visual concept learning paradigms.
Learning Grounded Meaning Representations with Autoencoders
TLDR
A new model is introduced which uses stacked autoencoders to learn higher-level embeddings from textual and visual input and which outperforms baselines and related models on similarity judgments and concept categorization.
Learning and Transferring Mid-level Image Representations Using Convolutional Neural Networks
TLDR
This work designs a method to reuse layers trained on the ImageNet dataset to compute mid-level image representation for images in the PASCAL VOC dataset, and shows that despite differences in image statistics and tasks in the two datasets, the transferred representation leads to significantly improved results for object and action classification.
DeViSE: A Deep Visual-Semantic Embedding Model
TLDR
This paper presents a new deep visual-semantic embedding model trained to identify visual objects using both labeled image data as well as semantic information gleaned from unannotated text and shows that the semantic information can be exploited to make predictions about tens of thousands of image labels not observed during training.
Multimodal learning with deep Boltzmann machines
TLDR
A Deep Boltzmann Machine is proposed for learning a generative model of multimodal data and it is shown that the model can be used to create fused representations by combining features across modalities, which are useful for classification and information retrieval.
Improving Multi-Modal Representations Using Image Dispersion: Why Less is Sometimes More
TLDR
An unsupervised method to determine whether to include perceptual input for a concept is proposed, and it is shown that it significantly improves the ability of multi-modal models to learn and represent word meanings.
Grounded Compositional Semantics for Finding and Describing Images with Sentences
TLDR
The DT-RNN model, which uses dependency trees to embed sentences into a vector space in order to retrieve images that are described by those sentences, outperform other recursive and recurrent neural networks, kernelized CCA and a bag-of-words baseline on the tasks of finding an image that fits a sentence description and vice versa.
Visualizing and Understanding Convolutional Networks
TLDR
A novel visualization technique is introduced that gives insight into the function of intermediate feature layers and the operation of the classifier in large Convolutional Network models, used in a diagnostic role to find model architectures that outperform Krizhevsky et al on the ImageNet classification benchmark.
Online multimodal deep similarity learning with application to image retrieval
TLDR
This paper proposes a novel framework of online multimodal deep similarity learning (OMDSL), which aims to optimally integrate multiple deep neural networks pretrained with stacked denoising autoencoder to improve similarity search in multimedia information retrieval tasks.
CNN Features Off-the-Shelf: An Astounding Baseline for Recognition
TLDR
A series of experiments conducted for different recognition tasks using the publicly available code and model of the OverFeat network which was trained to perform object classification on ILSVRC13 suggest that features obtained from deep learning with convolutional nets should be the primary candidate in most visual recognition tasks.
...
...