Corpus ID: 233296711

CLIPScore: A Reference-free Evaluation Metric for Image Captioning

  title={CLIPScore: A Reference-free Evaluation Metric for Image Captioning},
  author={Jack Hessel and Ariel Holtzman and Maxwell Forbes and Ronan Joseph Le Bras and Yejin Choi},
Image captioning has conventionally relied on reference-based automatic evaluations, where machine captions are compared against captions written by humans. This is in contrast to the reference-free manner in which humans assess caption quality. In this paper, we report the surprising empirical finding that CLIP (Radford et al., 2021), a cross-modal model pretrained on 400M image+caption pairs from the web, can be used for robust automatic evaluation of image captioning without the need for… Expand

Figures and Tables from this paper

From Show to Tell: A Survey on Image Captioning
This work aims at providing a comprehensive overview of image captioning approaches, from visual encoding and text generation to training strategies, datasets, and evaluation metrics, and quantitatively compare many relevant state-of-the-art approaches to identify the most impactful technical innovations in architectures and training strategies. Expand
Towards Generating and Evaluating Iconographic Image Captions of Artworks
The overall results suggest that the model can generate meaningful captions that indicate a stronger relevance to the art historical context, particularly in comparison to captions obtained from models trained only on natural image datasets. Expand
ImaginE: An Imagination-Based Automatic Evaluation Metric for Natural Language Generation
Experiments spanning several text generation tasks demonstrate that adding imagination with the IMAGINE displays great potential in introducing multi-modal information into NLG evaluation, and improves existing automatic metrics’ correlations with human similarity judgments in many circumstances. Expand
Unifying Multimodal Transformer for Bi-directional Image and Text Generation
  • Yupan Huang, Hongwei Xue, Bei Liu, Yutong Lu
  • 2021
We study the joint learning of image-to-text and text-to-image generations, which are naturally bi-directional tasks. Typical existing works design two separate task-specific models for each task,Expand


TIGEr: Text-to-Image Grounding for Image Caption Evaluation
The empirical tests show that TIGEr has a higher consistency with human judgments than alternative existing metrics, and comprehensively assess the metric’s effectiveness in caption evaluation by measuring the correlation between human judgments and metric scores. Expand
Show, Tell and Discriminate: Image Captioning by Self-retrieval with Partially Labeled Data
This work proposes an image captioning framework with a self-retrieval module as training guidance, which encourages generating discriminative captions and demonstrates the effectiveness of the proposed retrieval-guided method on COCO and Flickr30k captioning datasets, and shows its superior captioning performance with more discriminating captions. Expand
UMIC: An Unreferenced Metric for Image Captioning via Contrastive Learning
A new metric UMIC, an Unreferenced Metric for Image Captioning which does not require reference captions to evaluate image captions, is introduced based on Vision-and-Language BERT and trained to discriminate negative captions via contrastive learning. Expand
Learning to Evaluate Image Captioning
This work proposes a novel learning based discriminative evaluation metric that is directly trained to distinguish between human and machine-generated captions and proposes a data augmentation scheme to explicitly incorporate pathological transformations as negative examples during training. Expand
FAIEr: Fidelity and Adequacy Ensured Image Caption Evaluation
The results of the exam of different metrics on HM-MSCOCO show the high consistency of FAIEr with the human judgment, which demonstrates the evaluation strategy of the authors' metric can more accurately reveal the human evaluation intentions. Expand
Improving Image Captioning Evaluation by Considering Inter References Variance
A novel metric based on BERTScore is proposed that can achieve high human correlation in system-level tasks while some issues can be fixed for better performance and experimental results show that the metric achieves state-of-the-art human judgment correlation. Expand
Re-evaluating Automatic Metrics for Image Captioning
This paper provides an in-depth evaluation of the existing image captioning metrics through a series of carefully designed experiments and explores the utilization of the recently proposed Word Mover’s Distance document metric for the purpose of image Captioning. Expand
SPICE: Semantic Propositional Image Caption Evaluation
There is considerable interest in the task of automatically generating image captions. However, evaluation is challenging. Existing automatic evaluation metrics are primarily sensitive to n-gramExpand
Good News, Everyone! Context Driven Entity-Aware Captioning for News Images
This work proposes a novel captioning method able to leverage contextual information provided by the text of news articles associated with an image, able to selectively draw information from the article guided by visual cues, and to dynamically extend the output dictionary to out-of-vocabulary named entities that appear in the context source. Expand
ViLBERTScore: Evaluating Image Caption Using Vision-and-Language BERT
Experimental results on three benchmark datasets show that the proposed evaluation metric correlates significantly better with human judgments than all existing metrics. Expand