Learning Models for Object Recognition from Natural Language Descriptions

@inproceedings{Wang2009LearningMF,
  title={Learning Models for Object Recognition from Natural Language Descriptions},
  author={Josiah Wang and Katja Markert and Mark Everingham},
  booktitle={BMVC},
  year={2009}
}
We investigate the task of learning models for visual object recognition from natural language descriptions alone. The approach contributes to the recognition of fine-grain object categories, such as animal and plant species, where it may be difficult to collect many images for training, but where textual descriptions of visual attributes are readily available. As an example we tackle recognition of butterfly species, learning models from descriptions in an online nature guide. We propose… 

Figures and Tables from this paper

Combining visual recognition and computational linguistics : linguistic knowledge for visual recognition and natural language descriptions of visual content
TLDR
This thesis presents a novel approach for automatic video description and shows the benefits of extracting linguistic knowledge for object and activity recognition as well as the advantage of visual recognition for understanding activity descriptions.
Discriminative object categorization with external semantic knowledge
TLDR
The goal is to learn a discriminative model for categorization that leverages semantic knowledge for object recognition, with a special focus on the semantic relationship among different categories and concepts.
Learning Transferable Visual Models From Natural Language Supervision
TLDR
It is demonstrated that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet.
Write a Classifier: Predicting Visual Classifiers from Unstructured Text
TLDR
This work proposes and investigates two baseline formulations, based on regression and domain transfer, that predict a linear classifier and proposes a new constrained optimization formulation that combines a regression function and a knowledge transfer function with additional constraints to predict the parameters of alinear classifier.
Identifying and learning visual attributes for object recognition
  • K. Wan, S. Roy
  • Computer Science
    2010 IEEE International Conference on Image Processing
  • 2010
TLDR
This work proposes an attribute centric approach for visual object recognition, and introduces a novel algorithm that iteratively refines these distributions of patches over the attributes by a nearest-neighbor attribute classifier.
A Multi-class Approach - Building a Visual Classifier based on Textual Descriptions using Zero-Shot Learning
TLDR
This paper introduces a visual classifier which uses a concept of transfer learning, namely Zero-Shot Learning (ZSL), and standard Natural Language Processing techniques to train a classifier by mapping labelled images to their textual description instead of training it for specific classes.
Learning to Generate Scene Graph from Natural Language Supervision
TLDR
This paper proposes one of the first methods that learn from image-sentence pairs to extract a graphical representation of localized objects and their relationships within an image, known as scene graph, and designs a Transformer-based model to predict these "pseudo" labels via a masked token prediction task.
Harvesting Training Images for Fine-Grained Object Categories Using Visual Descriptions
TLDR
‘visual descriptions’ from nature guides are used as a novel augmentation to the well-known use of category names in both the query process to find potential category images as well as in image reranking where an image is more highly ranked if web page text surrounding it is similar to the visual descriptions.
Relative Attributes for Enhanced Human-Machine Communication
TLDR
Overall, it is found that relative attributes enhance the precision of communication between humans and computer vision algorithms, providing the richer language needed to fluidly "teach" a system about visual concepts.
BabyTalk: Understanding and Generating Simple Image Descriptions
TLDR
The proposed system to automatically generate natural language descriptions from images is very effective at producing relevant sentences for images and generates descriptions that are notably more true to the specific image content than previous work.
...
...

References

SHOWING 1-10 OF 23 REFERENCES
Learning object categories from Google's image search
TLDR
A new model, TSI-pLSA, is developed, which extends pLSA (as applied to visual words) to include spatial information in a translation and scale invariant manner, and can handle the high intra-class variability and large proportion of unrelated images returned by search engines.
Object Recognition as Machine Translation: Learning a Lexicon for a Fixed Image Vocabulary
TLDR
This work shows how to cluster words that individually are difficult to predict into clusters that can be predicted well, and cannot predict the distinction between train and locomotive using the current set of features, but can predict the underlying concept.
Beyond Nouns: Exploiting Prepositions and Comparative Adjectives for Learning Visual Classifiers
TLDR
This work simultaneously learns the visual features defining "nouns" and the differentialVisual features defining such "binary-relationships" using an EM-based approach and constrain the correspondence problem by exploiting additional language constructs to improve the learning process from weakly labeled data.
Describing objects by their attributes
TLDR
This paper proposes to shift the goal of recognition from naming to describing, and introduces a novel feature selection method for learning attributes that generalize well across categories.
OPTIMOL: Automatic Online Picture Collection via Incremental Model Learning
  • Li-Jia Li, Li Fei-Fei
  • Computer Science
    2007 IEEE Conference on Computer Vision and Pattern Recognition
  • 2007
TLDR
This paper presents a novel object recognition algorithm that performs automatic dataset collecting and incremental model learning simultaneously, and adapts a non-parametric latent topic model and proposes an incremental learning framework.
Learning to detect unseen object classes by between-class attribute transfer
TLDR
The experiments show that by using an attribute layer it is indeed possible to build a learning object detection system that does not require any training images of the target classes, and assembled a new large-scale dataset, “Animals with Attributes”, of over 30,000 animal images that match the 50 classes in Osherson's classic table of how strongly humans associate 85 semantic attributes with animal classes.
Learning realistic human actions from movies
TLDR
A new method for video classification that builds upon and extends several recent ideas including local space-time features,space-time pyramids and multi-channel non-linear SVMs is presented and shown to improve state-of-the-art results on the standard KTH action dataset.
Learning Color Names from Real-World Images
TLDR
Experimental results show that color names learned from real-world images significantly outperform color names learning from labelled color chips on retrieval and classification.
Harvesting Image Databases from the Web
TLDR
A multi-modal approach employing both text, meta data and visual features is used to gather many, high-quality images from the Web to automatically generate a large number of images for a specified object class.
Caltech-256 Object Category Dataset
TLDR
A challenging set of 256 object categories containing a total of 30607 images is introduced and the clutter category is used to train an interest detector which rejects uninformative background regions.
...
...