Learning Visual N-Grams from Web Data

  title={Learning Visual N-Grams from Web Data},
  author={Ang Li and A. Jabri and Armand Joulin and Laurens van der Maaten},
  journal={2017 IEEE International Conference on Computer Vision (ICCV)},
  • Ang Li, A. Jabri, L. V. D. Maaten
  • Published 29 December 2016
  • Computer Science
  • 2017 IEEE International Conference on Computer Vision (ICCV)
Real-world image recognition systems need to recognize tens of thousands of classes that constitute a plethora of visual concepts. The traditional approach of annotating thousands of images per class for training is infeasible in such a scenario, prompting the use of webly supervised data. This paper explores the training of image-recognition systems on large numbers of images and associated user comments, without using manually labeled images. In particular, we develop visual n-gram models… 

Figures and Tables from this paper

Learning Transferable Visual Models From Natural Language Supervision

It is demonstrated that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet.

Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision

A simple dual-encoder architecture learns to align visual and language representations of the image and text pairs using a contrastive loss, and it is shown that the scale of the corpus can make up for its noise and leads to state-of-the-art representations even with such a simple learning scheme.

Webly Supervised Joint Embedding for Cross-Modal Image-Text Retrieval

This work proposes a two-stage approach for the task that can augment a typical supervised pair-wise ranking loss based formulation with weakly-annotated web images to learn a more robust visual-semantic embedding.

A Fistful of Words: Learning Transferable Visual Models from Bag-of-Words Supervision

This paper shows that a simple Bag-of-Words (BoW) caption could be used as a replacement for most of the image captions in the dataset and observes that this approach improves the zero-shot classification performance when combined with word balancing.

Exploring the Limits of Weakly Supervised Pretraining

This paper presents a unique study of transfer learning with large convolutional networks trained to predict hashtags on billions of social media images and shows improvements on several image classification and object detection tasks, and reports the highest ImageNet-1k single-crop, top-1 accuracy to date.

VirTex: Learning Visual Representations from Textual Annotations

VirTex is proposed – a pretraining approach using semantically dense captions to learn visual representations that match or exceed those learned on ImageNet – supervised or unsupervised – despite using up to ten times fewer images.

Data Efficient Language-supervised Zero-shot Recognition with Optimal Transport Distillation

OTTER (Optimal TransporT distillation for Efficient zero-shot Recognition), which uses online entropic optimal transport to find a soft image-text match as labels for contrastive learning and achieves strong performance with only 3M image text pairs.

Learning Visual Representations with Caption Annotations

It is argued that captioned images are easily crawlable and can be exploited to supervise the training of visual representations, and proposed hybrid models, with dedicated visual and textual encoders, show that the visual representations learned as a by-product of solving this task transfer well to a variety of target tasks.

Grounding natural language phrases in images and video

This dissertation introduces a new dataset which provides the ground truth annotations of the location of noun phrase chunks in image captions, and introduces a model which learns a set of models, each of which capture a different concept which is useful in the task.

Learning Video Representations from Textual Web Supervision

This work proposes a data collection process and uses it to collect 70M video clips, and trains a model to pair each video with its associated text, which leads to improvements over from-scratch training on all benchmarks, and outperforms many methods for self-supervised and webly-super supervised video representation learning.



Learning Visual Features from Large Weakly Supervised Data

This paper trains convolutional networks on a dataset of 100 million Flickr photos and comments, and shows that these networks produce features that perform well in a range of vision problems.

Harvesting Mid-level Visual Concepts from Large-Scale Internet Images

This paper proposes a fully automatic algorithm which harvests visual concepts from a large number of Internet images using text-based queries, and shows significant improvement over the competing systems in image classification, including those with strong supervision.

DeViSE: A Deep Visual-Semantic Embedding Model

This paper presents a new deep visual-semantic embedding model trained to identify visual objects using both labeled image data as well as semantic information gleaned from unannotated text and shows that the semantic information can be exploited to make predictions about tens of thousands of image labels not observed during training.

Phrase-based Image Captioning

This paper presents a simple model that is able to generate descriptive sentences given a sample image and proposes a simple language model that can produce relevant descriptions for a given test image using the phrases inferred.

Learning Object Categories From Internet Image Searches

A simple approach to learning models of visual object categories from images gathered from Internet image search engines, derived from the probabilistic latent semantic analysis technique for text document analysis, that can be used to automatically learn object models from these data.

From captions to visual concepts and back

This paper uses multiple instance learning to train visual detectors for words that commonly occur in captions, including many different parts of speech such as nouns, verbs, and adjectives, and develops a maximum-entropy language model.

Learning from massive noisy labeled data for image classification

A general framework to train CNNs with only a limited number of clean labels and millions of easily obtained noisy labels is introduced and the relationships between images, class labels and label noises are model with a probabilistic graphical model and further integrate it into an end-to-end deep learning system.

Webly Supervised Learning of Convolutional Networks

  • Xinlei ChenA. Gupta
  • Computer Science
    2015 IEEE International Conference on Computer Vision (ICCV)
  • 2015
This work uses easy images to train an initial visual representation and uses this initial CNN to adapt it to harder, more realistic images by leveraging the structure of data and categories, and demonstrates the strength of webly supervised learning by localizing objects in web images and training a R-CNN style detector.

Keywords to visual categories: Multiple-instance learning forweakly supervised object categorization

This work proposes an unsupervised approach to construct discriminative models for categories specified simply by their names, and shows that multiple-instance learning enables the recovery of robust category models from images returned by keyword-based search engines.

Composing Simple Image Descriptions using Web-scale N-grams

A simple yet effective approach to automatically compose image descriptions given computer vision based inputs and using web-scale n-grams, which indicates that it is viable to generate simple textual descriptions that are pertinent to the specific content of an image, while permitting creativity in the description -- making for more human-like annotations than previous approaches.