nocaps: novel object captioning at scale

@article{Agrawal2019nocapsNO,
  title={nocaps: novel object captioning at scale},
  author={Harsh Agrawal and Karan Desai and Yufei Wang and Xinlei Chen and Rishabh Jain and Mark Johnson and Dhruv Batra and Devi Parikh and Stefan Lee and Peter Anderson},
  journal={International Conference on Computer Vision},
  year={2019},
  pages={8947-8956}
}
Image captioning models have achieved impressive results on datasets containing limited visual concepts and large amounts of paired image-caption training data. However, if these models are to ever function in the wild, a much larger variety of visual concepts must be learned, ideally from less supervision. To encourage the development of image captioning models that can learn visual concepts from alternative data sources, such as object detection datasets, we present the first large-scale… 
Partially-supervised novel object captioning leveraging context from paired data
TLDR
Partially-Supervised Novel Object Captioning is agnostic to model architecture, and primarily focuses on the training approach that uses existing fully paired image-caption data and the images with only the novel object detection labels (partially paired data).
Learning to Select: A Fully Attentive Approach for Novel Object Captioning
TLDR
This paper presents a novel approach for NOC that learns to select the most relevant objects of an image, regardless of their adherence to the training set, and to constrain the generative process of a language model accordingly.
VIVO: Visual Vocabulary Pre-Training for Novel Object Captioning
TLDR
The results show that the VIsual VOcabulary pretraining (VIVO) model can not only generate fluent image captions that describe novel objects, but also identify the locations of these objects.
Leveraging Human Attention in Novel Object Captioning
TLDR
The Attentionbased Novel Object Captioner (ANOC) is presented, which introduces a gating mechanism that adaptively incorporates human attention with self-learned machine attention, with a Constrained Self-Critical Sequence Training method to address the exposure bias while maintaining constraints of novel object descriptions.
Self-Distillation for Few-Shot Image Captioning
TLDR
An ensemble- based self-distillation method that allows image captioning models to be trained with unpaired images and captions and a simple yet effective pseudo feature generation method based on Gradient Descent is proposed.
Switchable Novel Object Captioner.
TLDR
This paper introduces the zero-shot novel object captioning task, where the machine generates descriptions about novel objects without extra training sentences, and proposes a Switchable LSTM that incorporates knowledge from the object memory into sentence generation.
A Baseline for Detecting Out-of-Distribution Examples in Image Captioning
TLDR
The effectiveness of the caption’s likelihood score at detecting and rejecting OOD images, which implies that the relatedness between the input image and the generated caption is encapsulated within the score, is analyzed.
Automated testing of image captioning systems
TLDR
The core idea is that the object names should exhibit directional changes after object insertion, and MetaIC, the first metamorphic testing approach to validate IC systems, is proposed, which can be further generalized to detect label errors in the training dataset.
ClipCap: CLIP Prefix for Image Captioning
TLDR
This paper uses CLIP encoding as a prefix to the caption, by employing a simple mapping network, and then fine-tunes a language model to generate the image captions, allowing a lighter architecture with less trainable parameters.
...
...

References

SHOWING 1-10 OF 60 REFERENCES
Captioning Images with Diverse Objects
TLDR
The Novel Object Captioner (NOC) is proposed, a deep visual semantic captioning model that can describe a large number of object categories not present in existing image-caption datasets, taking advantage of external sources, labeled images from object recognition datasets, and semantic knowledge extracted from unannotated text.
Partially-Supervised Image Captioning
TLDR
This work proposes a novel algorithm for training sequence models, such as recurrent neural networks, on partially-specified sequences which it represents using finite state automata and shows that it can train a captioning model to describe new visual concepts from the Open Images dataset while maintaining competitive COCO evaluation scores.
Deep Compositional Captioning: Describing Novel Object Categories without Paired Training Data
TLDR
The Deep Compositional Captioner (DCC) is proposed to address the task of generating descriptions of novel objects which are not present in paired imagesentence datasets by leveraging large object recognition datasets and external text corpora and by transferring knowledge between semantically similar concepts.
Guided Open Vocabulary Image Captioning with Constrained Beam Search
TLDR
This work uses constrained beam search to force the inclusion of selected tag words in the output, and fixed, pretrained word embeddings to facilitate vocabulary expansion to previously unseen tag words to achieve state of the art results for out-of- domain captioning on MSCOCO (and improved results for in-domain captioning).
Decoupled Novel Object Captioner
TLDR
The Decoupled Novel Object Captioner (DNOC) framework is proposed that can fully decouple the language sequence model from the object descriptions and the experimental results on the held-out MSCOCO dataset demonstrate the ability of DNOC in describing novel concepts.
Incorporating Copying Mechanism in Image Captioning for Learning Novel Objects
TLDR
A new architecture that incorporates copying into the Convolutional Neural Networks plus Recurrent Neural Networks (RNN) image captioning framework, for describing novel objects in captions, and superior results are reported when compared to state-of-the-art deep models.
Rich Image Captioning in the Wild
TLDR
An image caption system that addresses new challenges of automatically describing images in the wild by developing a deep vision model that detects a broad range of visual concepts, an entity recognition model that identifies celebrities and landmarks, and a confidence model for the caption output.
Self-Critical Sequence Training for Image Captioning
TLDR
Inspired by the recently introduced encoder/decoder paradigm for machine translation using recurrent neural networks, three systems use a deep convolutional neural network (CNN) to encode the input image, and a LSTM RNN decoder to generate the output caption.
Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning
We present a new dataset of image caption annotations, Conceptual Captions, which contains an order of magnitude more images than the MS-COCO dataset (Lin et al., 2014) and represents a wider variety
From captions to visual concepts and back
TLDR
This paper uses multiple instance learning to train visual detectors for words that commonly occur in captions, including many different parts of speech such as nouns, verbs, and adjectives, and develops a maximum-entropy language model.
...
...