TIGEr: Text-to-Image Grounding for Image Caption Evaluation

@inproceedings{Jiang2019TIGErTG,
  title={TIGEr: Text-to-Image Grounding for Image Caption Evaluation},
  author={Ming Jiang and Qiuyuan Huang and Lei Zhang and Xin Wang and Pengchuan Zhang and Zhe Gan and Jana Diesner and Jianfeng Gao},
  booktitle={EMNLP},
  year={2019}
}
This paper presents a new metric called TIGEr for the automatic evaluation of image captioning systems. Popular metrics, such as BLEU and CIDEr, are based solely on text matching between reference captions and machine-generated captions, potentially leading to biased evaluations because references may not fully cover the image content and natural language is inherently ambiguous. Building upon a machine-learned text-image grounding model, TIGEr allows to evaluate caption quality not only based… Expand

Figures, Tables, and Topics from this paper

CLIPScore: A Reference-free Evaluation Metric for Image Captioning
TLDR
The surprising empirical finding that CLIP, a cross-modal model pretrained on 400M image+caption pairs from the web, can be used for robust automatic evaluation of image captioning without the need for references is reported. Expand
ViLBERTScore: Evaluating Image Caption Using Vision-and-Language BERT
TLDR
Experimental results on three benchmark datasets show that the proposed evaluation metric correlates significantly better with human judgments than all existing metrics. Expand
WEmbSim: A Simple yet Effective Metric for Image Captioning
TLDR
This work proposes an effective metric WEmbSim, which beats complex measures such as SPICE, CIDEr and WMD at system-level correlation with human judgments and sets a new baseline for any complex metric to be justified. Expand
Contrastive Semantic Similarity Learning for Image Captioning Evaluation with Intrinsic Auto-encoder
TLDR
This work proposes a learning-based metric for image captioning, which it is called Intrinsic Image Captioning Evaluation (ICE), and develops three progressive model structures to learn the sentence level representations–single branch model, dual branches model, and triple branches model. Expand
Can Audio Captions Be Evaluated with Image Caption Metrics?
  • Zelin Zhou, Zhiling Zhang, Xuenan Xu, Zeyu Xie, Mengyue Wu, Kenny Q. Zhu
  • Computer Science, Engineering
  • ArXiv
  • 2021
TLDR
A metric named FENSE, where the strength of Sentence-BERT in capturing similarity, and a novel Error Detector to penalize erroneous sentences for robustness is proposed, which outperforms current metrics by 14-25% accuracy. Expand
FAIEr: Fidelity and Adequacy Ensured Image Caption Evaluation
TLDR
The results of the exam of different metrics on HM-MSCOCO show the high consistency of FAIEr with the human judgment, which demonstrates the evaluation strategy of the authors' metric can more accurately reveal the human evaluation intentions. Expand
SMURF: SeMantic and linguistic UndeRstanding Fusion for Caption Evaluation via Typicality Analysis
TLDR
This work introduces “typicality”, a new formulation of evaluation rooted in information theory, which is uniquely suited for problems lacking a definite ground truth, and develops a novel semantic comparison, SPARCS, as well as referenceless fluency evaluation metrics. Expand
ImaginE: An Imagination-Based Automatic Evaluation Metric for Natural Language Generation
TLDR
Experiments spanning several text generation tasks demonstrate that adding imagination with the IMAGINE displays great potential in introducing multi-modal information into NLG evaluation, and improves existing automatic metrics’ correlations with human similarity judgments in many circumstances. Expand
From Show to Tell: A Survey on Image Captioning
TLDR
This work aims at providing a comprehensive overview of image captioning approaches, from visual encoding and text generation to training strategies, datasets, and evaluation metrics, and quantitatively compare many relevant state-of-the-art approaches to identify the most impactful technical innovations in architectures and training strategies. Expand
Evaluation of Text Generation: A Survey
TLDR
This paper surveys evaluation methods of natural language generation (NLG) systems that have been developed in the last few years, with a focus on the evaluation of recently proposed NLG tasks and neural NLG models. Expand
...
1
2
...

References

SHOWING 1-10 OF 37 REFERENCES
SPICE: Semantic Propositional Image Caption Evaluation
There is considerable interest in the task of automatically generating image captions. However, evaluation is challenging. Existing automatic evaluation metrics are primarily sensitive to n-gramExpand
Learning to Evaluate Image Captioning
TLDR
This work proposes a novel learning based discriminative evaluation metric that is directly trained to distinguish between human and machine-generated captions and proposes a data augmentation scheme to explicitly incorporate pathological transformations as negative examples during training. Expand
From captions to visual concepts and back
TLDR
This paper uses multiple instance learning to train visual detectors for words that commonly occur in captions, including many different parts of speech such as nouns, verbs, and adjectives, and develops a maximum-entropy language model. Expand
Comparing Automatic Evaluation Measures for Image Description
TLDR
The main finding is that unigram BLEU has a weak correlation, and Meteor has the strongest correlation with human judgements. Expand
Framing Image Description as a Ranking Task: Data, Models and Evaluation Metrics (Extended Abstract)
TLDR
This work proposes to frame sentence-based image annotation as the task of ranking a given pool of captions, and introduces a new benchmark collection, consisting of 8,000 images that are each paired with five different captions which provide clear descriptions of the salient entities and events. Expand
CIDEr: Consensus-based image description evaluation
TLDR
A novel paradigm for evaluating image descriptions that uses human consensus is proposed and a new automated metric that captures human judgment of consensus better than existing metrics across sentences generated by various sources is evaluated. Expand
Stacked Cross Attention for Image-Text Matching
TLDR
Stacked Cross Attention to discover the full latent alignments using both image regions and words in sentence as context and infer the image-text similarity achieves the state-of-the-art results on the MS-COCO and Flickr30K datasets. Expand
ROUGE: A Package for Automatic Evaluation of Summaries
TLDR
Four different RouGE measures are introduced: ROUGE-N, ROUge-L, R OUGE-W, and ROUAGE-S included in the Rouge summarization evaluation package and their evaluations. Expand
Semantic Compositional Networks for Visual Captioning
  • Zhe Gan, Chuang Gan, +5 authors L. Deng
  • Computer Science
  • 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
  • 2017
TLDR
Experimental results show that the proposed method significantly outperforms prior state-of-the-art approaches, across multiple evaluation metrics. Expand
A learning approach to improving sentence-level MT evaluation
TLDR
A novel method involving classifying translations as machine or humanproduced rather than directly predicting numerical human judgments eliminates the need for labor-intensive user studies as a source of training data and is shown to significantly improve upon current automatic metrics. Expand
...
1
2
3
4
...