• Corpus ID: 261138

DeViSE: A Deep Visual-Semantic Embedding Model

@inproceedings{Frome2013DeViSEAD,
  title={DeViSE: A Deep Visual-Semantic Embedding Model},
  author={Andrea Frome and Gregory S. Corrado and Jonathon Shlens and Samy Bengio and Jeffrey Dean and Marc'Aurelio Ranzato and Tomas Mikolov},
  booktitle={NIPS},
  year={2013}
}
Modern visual recognition systems are often limited in their ability to scale to large numbers of object categories. [] Key Result Semantic knowledge improves such zero-shot predictions achieving hit rates of up to 18% across thousands of novel labels never seen by the visual model.

Figures and Tables from this paper

Learning Deep Representations of Fine-Grained Visual Descriptions
TLDR
This model achieves strong performance on zero-shot text-based image retrieval and significantly outperforms the attribute-based state-of-the-art for zero- shot classification on the Caltech-UCSD Birds 200-2011 dataset.
Semantic Web and Zero-Shot Learning of Large Scale Visual Classes
TLDR
This paper proposes to leverage the interlinking of knowledge bases published as Linked Open Data to provide different semantic feature representations of visual classes in a large-scale setting and finds that some of them outperform the standard word embedding representations by a significant margin.
Zero-Shot Recognition based on Semantic Embeddings and Deep Clustering
TLDR
A model that can discover new objects in images even if no training data is available for the object class is introduced, which considers two aspects, i.e., the accuracy of mapping in semantic space and the fitness for embeddings’ clustering.
Generalized Zero-shot Learning with Multi-source Semantic Embeddings for Scene Recognition
TLDR
This paper focuses on zero-shot scene recognition, a more challenging setting with hundreds of categories where their differences can be subtle and often localized in certain objects or regions, and proposes a feature generation framework with two novel components: multiple sources of semantic information that can enhance scene discrimination.
Semantic Attributes for Transfer Learning in Visual Recognition
TLDR
This thesis investigates how to carry on visual recognition when there are few or no training samples and introduces novel solutions on how to model and transfer semantic attributes, how to automatically associate them with visual categories.
Visually grounded word embeddings for zero-shot learning of visual categories
TLDR
This paper represents visual classes as words and proposes to learn their high-level visual feature representations from the co-occurence statistics of their label in large text corpora to be seen as implicitly learning visually-grounded word embeddings.
Learning Robust Visual-Semantic Embeddings
TLDR
An end-to-end learning framework that is able to extract more robust multi-modal representations across domains and a novel technique of unsupervised-data adaptation inference is introduced to construct more comprehensive embeddings for both labeled and unlabeled data.
Contrastive Learning of Visual-Semantic Embeddings
TLDR
This paper proposes two loss functions based on normalized crossentropy to perform the task of learning joint visual-semantic embedding using batch contrastive training for cross-modal image-to-text and text- to-image retrieval tasks using the MS-COCO and Flickr30K datasets.
Open-World Visual Recognition Using Knowledge Graphs
TLDR
This work proposes a two-step approach that utilizes information from knowledge graphs to make semantically meaningful predictions for images belonging to previously unknown class labels, and increases performance for visual recognition by a factor of six compared to the baseline.
Predicting Deep Zero-Shot Convolutional Neural Networks Using Textual Descriptions
TLDR
A new model is presented that can classify unseen categories from their textual description and takes advantage of the architecture of CNNs and learn features at different layers, rather than just learning an embedding space for both modalities, as is common with existing approaches.
...
...

References

SHOWING 1-10 OF 27 REFERENCES
Zero-Shot Learning Through Cross-Modal Transfer
TLDR
This work introduces a model that can recognize objects in images even if no training data is available for the object class, and uses novelty detection methods to differentiate unseen classes from seen classes.
Zero-Shot Learning by Convex Combination of Semantic Embeddings
TLDR
A simple method for constructing an image embedding system from any existing image classifier and a semantic word embedding model, which contains the $\n$ class labels in its vocabulary is proposed, which outperforms state of the art methods on the ImageNet zero-shot learning task.
Evaluating knowledge transfer and zero-shot learning in a large-scale setting
TLDR
An extensive evaluation of three popular approaches to KT on a recently proposed large-scale data set, the ImageNet Large Scale Visual Recognition Competition 2010 dataSet, finding none of the KT methods can improve over one-vs-all classification but prove valuable for zero-shot learning, especially hierarchical and direct similarity based KT.
Zero-shot Learning with Semantic Output Codes
TLDR
A semantic output code classifier which utilizes a knowledge base of semantic properties of Y to extrapolate to novel classes and can often predict words that people are thinking about from functional magnetic resonance images of their neural activity, even without training examples for those words.
Visual and semantic similarity in ImageNet
TLDR
The insights gained from analysis enable building a novel distance function between images assessing whether they are from the same basic-level category, which goes beyond direct visual distance as it also exploits semantic similarity measured through ImageNet.
ImageNet Large Scale Visual Recognition Challenge
TLDR
The creation of this benchmark dataset and the advances in object recognition that have been possible as a result are described, and the state-of-the-art computer vision accuracy with human accuracy is compared.
ImageNet: A large-scale hierarchical image database
TLDR
A new database called “ImageNet” is introduced, a large-scale ontology of images built upon the backbone of the WordNet structure, much larger in scale and diversity and much more accurate than the current image datasets.
Label Embedding Trees for Large Multi-Class Tasks
TLDR
An algorithm for learning a tree-structure of classifiers which, by optimizing the overall tree loss, provides superior accuracy to existing tree labeling methods and a method that learns to embed labels in a low dimensional space that is faster than non-embedding approaches and has superior accuracyto existing embedding approaches are proposed.
The Importance of Encoding Versus Training with Sparse Coding and Vector Quantization
TLDR
This work investigates the reasons for the success of sparse coding over VQ by decoupling these phases, allowing us to separate out the contributions of training and encoding in a controlled way and shows not only that it can use fast VQ algorithms for training, but that they can just as well use randomly chosen exemplars from the training set.
ImageNet classification with deep convolutional neural networks
TLDR
A large, deep convolutional neural network was trained to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes and employed a recently developed regularization method called "dropout" that proved to be very effective.
...
...