Learning to Generate Grounded Image Captions without Localization Supervision
@article{Ma2019LearningTG, title={Learning to Generate Grounded Image Captions without Localization Supervision}, author={Chih-Yao Ma and Yannis Kalantidis and G. Al-Regib and P{\'e}ter Vajda and Marcus Rohrbach and Z. Kira}, journal={ArXiv}, year={2019}, volume={abs/1906.00283} }
When generating a sentence description for an image, it frequently remains unclear how well the generated caption is grounded in the image or if the model hallucinates based on priors in the dataset and/or the language model. [...] Key Method In this work, we propose a novel cyclical training regimen that forces the model to localize each word in the image after the sentence decoder generates it and then reconstruct the sentence from the localized image region(s) to match the ground-truth.Expand Abstract
Figures, Tables, and Topics from this paper
4 Citations
More Grounded Image Captioning by Distilling Image-Text Matching Model
- Computer Science
- 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
- 2020
- 8
- Highly Influenced
- PDF
Thoracic Disease Identification and Localization using Distance Learning and Region Verification
- Computer Science
- BMVC
- 2020
- 2
- PDF
References
SHOWING 1-10 OF 53 REFERENCES
Generating Descriptions with Grounded and Co-referenced People
- Computer Science
- 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
- 2017
- 37
- PDF
Neural Baby Talk
- Computer Science
- 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition
- 2018
- 207
- PDF
Show and tell: A neural image caption generator
- Computer Science
- 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
- 2015
- 3,635
- PDF
Jointly Localizing and Describing Events for Dense Video Captioning
- Computer Science
- 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition
- 2018
- 58
- PDF
Adversarial Inference for Multi-Sentence Video Description
- Computer Science
- 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
- 2019
- 11
- PDF
Captioning Images with Diverse Objects
- Computer Science
- 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
- 2017
- 96
- PDF
Weakly-Supervised Visual Grounding of Phrases with Linguistic Structures
- Computer Science
- 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
- 2017
- 71
- PDF
Mind's eye: A recurrent visual representation for image caption generation
- Computer Science
- 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
- 2015
- 387
- PDF