Quality Estimation for Image Captions Based on Large-scale Human Evaluations

  title={Quality Estimation for Image Captions Based on Large-scale Human Evaluations},
  author={Tomer Levinboim and Ashish V. Thapliyal and Piyush Sharma and Radu Soricut},
Automatic image captioning has improved significantly over the last few years, but the problem is far from being solved, with state of the art models still often producing low quality captions when used in the wild. In this paper, we focus on the task of Quality Estimation (QE) for image captions, which attempts to model the caption quality from a human perspective and *without* access to ground-truth references, so that it can be applied at prediction time to detect low-quality captions… Expand

Figures and Tables from this paper

UMIC: An Unreferenced Metric for Image Captioning via Contrastive Learning
A new metric UMIC, an Unreferenced Metric for Image Captioning which does not require reference captions to evaluate image captions, is introduced based on Vision-and-Language BERT and trained to discriminate negative captions via contrastive learning. Expand
NoisyActions2M: A Multimedia Dataset for Video Understanding from Noisy Labels
This work creates a benchmark dataset consisting of around 2 million videos with associated user-generated annotations and other meta information, and utilizes the collected dataset for action classification and demonstrates its usefulness with existing small-scale annotated datasets, UCF101 and HMDB51. Expand
SMURF: SeMantic and linguistic UndeRstanding Fusion for Caption Evaluation via Typicality Analysis
This work introduces “typicality”, a new formulation of evaluation rooted in information theory, which is uniquely suited for problems lacking a definite ground truth, and develops a novel semantic comparison, SPARCS, as well as referenceless fluency evaluation metrics. Expand
Cross-modal Language Generation using Pivot Stabilization for Web-scale Language Coverage
An approach called Pivot-Language Generation Stabilization (PLuGS), which leverages directly at training time both existing English annotations as well as their machine-translated versions (silver data); at run-time, it generates first an English caption and then a corresponding target-language caption. Expand
Image-text discourse coherence relation discoveries on multi-image and multi-text documents
The occurrences of the image-text pairs are common in social media such as image captions, image annotations, and cooking instructions. However, the links between images and texts as well as theExpand
Achieving Common Ground in Multi-modal Dialogue
This tutorial highlights a number of achievements of recent computational research in coordinating complex content, shows how these results lead to rich and challenging opportunities for doing grounding in more flexible and powerful ways, and canvass relevant insights from the literature on human–human conversation. Expand
Reinforcing an Image Caption Generator Using Off-Line Human Feedback
The empirical evidence indicates that the proposed policy gradient method learns to generalize the human raters' judgments to a previously unseen set of images, as judged by a different set of human judges, and additionally on a different, multi-dimensional side-by-side human evaluation procedure. Expand


Learning to Evaluate Image Captioning
This work proposes a novel learning based discriminative evaluation metric that is directly trained to distinguish between human and machine-generated captions and proposes a data augmentation scheme to explicitly incorporate pathological transformations as negative examples during training. Expand
SPICE: Semantic Propositional Image Caption Evaluation
There is considerable interest in the task of automatically generating image captions. However, evaluation is challenging. Existing automatic evaluation metrics are primarily sensitive to n-gramExpand
Re-evaluating Automatic Metrics for Image Captioning
This paper provides an in-depth evaluation of the existing image captioning metrics through a series of carefully designed experiments and explores the utilization of the recently proposed Word Mover’s Distance document metric for the purpose of image Captioning. Expand
Show and Tell: Lessons Learned from the 2015 MSCOCO Image Captioning Challenge
A generative model based on a deep recurrent architecture that combines recent advances in computer vision and machine translation and that can be used to generate natural sentences describing an image is presented. Expand
CIDEr: Consensus-based image description evaluation
A novel paradigm for evaluating image descriptions that uses human consensus is proposed and a new automated metric that captures human judgment of consensus better than existing metrics across sentences generated by various sources is evaluated. Expand
Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning
We present a new dataset of image caption annotations, Conceptual Captions, which contains an order of magnitude more images than the MS-COCO dataset (Lin et al., 2014) and represents a wider varietyExpand
Decoupled Box Proposal and Featurization with Ultrafine-Grained Semantic Labels Improve Image Captioning and Visual Question Answering
This paper examines the effect of decoupling box proposal and featurization for down-stream tasks and demonstrates that this leads to effective transfer learning and improved image captioning and visual question answering models, as measured on publicly-available benchmarks. Expand
Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering
A combined bottom-up and top-down attention mechanism that enables attention to be calculated at the level of objects and other salient image regions is proposed, demonstrating the broad applicability of this approach to VQA. Expand
VIFIDEL: Evaluating the Visual Fidelity of Image Descriptions
A novel image-aware metric, VIFIDEL, that estimates the faithfulness of a generated caption with respect to the content of the actual image, based on the semantic similarity between labels of objects depicted in images and words in the description. Expand
Informative Image Captioning with External Sources of Information
This work introduces a multimodal, multi-encoder model based on Transformer that ingests both image features and multiple sources of entity labels and demonstrates that it can learn to control the appearance of these entity labels in the output, resulting in captions that are both fluent and informative. Expand