nocaps: novel object captioning at scale

  title={nocaps: novel object captioning at scale},
  author={Harsh Agrawal and Karan Desai and Yufei Wang and Xinlei Chen and Rishabh Jain and Mark Johnson and Dhruv Batra and Devi Parikh and Stefan Lee and Peter Anderson},
  journal={International Conference on Computer Vision},
Image captioning models have achieved impressive results on datasets containing limited visual concepts and large amounts of paired image-caption training data. However, if these models are to ever function in the wild, a much larger variety of visual concepts must be learned, ideally from less supervision. To encourage the development of image captioning models that can learn visual concepts from alternative data sources, such as object detection datasets, we present the first large-scale… 

Partially-supervised novel object captioning leveraging context from paired data

Partially-Supervised Novel Object Captioning is agnostic to model architecture, and primarily focuses on the training approach that uses existing fully paired image-caption data and the images with only the novel object detection labels (partially paired data).

VIVO: Surpassing Human Performance in Novel Object Captioning with Visual Vocabulary Pre-Training

The results show that the VIsual VOcabulary pre-training model can not only generate fluent image captions that describe novel objects, but also identify the locations of these objects.

Learning to Select: A Fully Attentive Approach for Novel Object Captioning

This paper presents a novel approach for NOC that learns to select the most relevant objects of an image, regardless of their adherence to the training set, and to constrain the generative process of a language model accordingly.

VIVO: Visual Vocabulary Pre-Training for Novel Object Captioning

The results show that the VIsual VOcabulary pre-training model can not only generate fluent image captions that describe novel objects, but also identify the locations of these objects.

Leveraging Human Attention in Novel Object Captioning

The Attentionbased Novel Object Captioner (ANOC) is presented, which introduces a gating mechanism that adaptively incorporates human attention with self-learned machine attention, with a Constrained Self-Critical Sequence Training method to address the exposure bias while maintaining constraints of novel object descriptions.

Switchable Novel Object Captioner

This paper introduces the zero-shot novel object captioning task, where the machine generates descriptions about novel objects without extra training sentences, and proposes a Switchable LSTM that incorporates knowledge from the object memory into sentence generation.

Self-Distillation for Few-Shot Image Captioning

An ensemble- based self-distillation method that allows image captioning models to be trained with unpaired images and captions and a simple yet effective pseudo feature generation method based on Gradient Descent is proposed.

Image-Caption Pair Replacement Algorithm towards Semi-supervised Novel Object Captioning

  • Yang Yang
  • Computer Science
    2022 7th International Conference on Intelligent Computing and Signal Processing (ICSP)
  • 2022
A bounding box scaling algorithm to address the problem of the strict resolution and aspect ratio replacement condition between novel objects and source objects and a two-stage semantic graph structure to reduce phrase collocation errors in the context, relying on co-occurring semantic adjacency associations.

A Baseline for Detecting Out-of-Distribution Examples in Image Captioning

The effectiveness of the caption's likelihood score at detecting and rejecting OOD images, which implies that the relatedness between the input image and the generated caption is encapsulated within the score, is analyzed.

Captioning Images with Diverse Objects

The Novel Object Captioner (NOC) is proposed, a deep visual semantic captioning model that can describe a large number of object categories not present in existing image-caption datasets, taking advantage of external sources, labeled images from object recognition datasets, and semantic knowledge extracted from unannotated text.

Partially-Supervised Image Captioning

This work proposes a novel algorithm for training sequence models, such as recurrent neural networks, on partially-specified sequences which it represents using finite state automata and shows that it can train a captioning model to describe new visual concepts from the Open Images dataset while maintaining competitive COCO evaluation scores.

Deep Compositional Captioning: Describing Novel Object Categories without Paired Training Data

The Deep Compositional Captioner (DCC) is proposed to address the task of generating descriptions of novel objects which are not present in paired imagesentence datasets by leveraging large object recognition datasets and external text corpora and by transferring knowledge between semantically similar concepts.

Guided Open Vocabulary Image Captioning with Constrained Beam Search

This work uses constrained beam search to force the inclusion of selected tag words in the output, and fixed, pretrained word embeddings to facilitate vocabulary expansion to previously unseen tag words to achieve state of the art results for out-of- domain captioning on MSCOCO (and improved results for in-domain captioning).

Decoupled Novel Object Captioner

The Decoupled Novel Object Captioner (DNOC) framework is proposed that can fully decouple the language sequence model from the object descriptions and the experimental results on the held-out MSCOCO dataset demonstrate the ability of DNOC in describing novel concepts.

Incorporating Copying Mechanism in Image Captioning for Learning Novel Objects

A new architecture that incorporates copying into the Convolutional Neural Networks plus Recurrent Neural Networks (RNN) image captioning framework, for describing novel objects in captions, and superior results are reported when compared to state-of-the-art deep models.

Rich Image Captioning in the Wild

An image caption system that addresses new challenges of automatically describing images in the wild by developing a deep vision model that detects a broad range of visual concepts, an entity recognition model that identifies celebrities and landmarks, and a confidence model for the caption output.

Neural Baby Talk

A novel framework for image captioning that can produce natural language explicitly grounded in entities that object detectors find in the image is introduced and reaches state-of-the-art on both COCO and Flickr30k datasets.

Self-Critical Sequence Training for Image Captioning

Inspired by the recently introduced encoder/decoder paradigm for machine translation using recurrent neural networks, three systems use a deep convolutional neural network (CNN) to encode the input image, and a LSTM RNN decoder to generate the output caption.

Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning

We present a new dataset of image caption annotations, Conceptual Captions, which contains an order of magnitude more images than the MS-COCO dataset (Lin et al., 2014) and represents a wider variety