Corpus ID: 173990332

Learning to Generate Grounded Image Captions without Localization Supervision

  title={Learning to Generate Grounded Image Captions without Localization Supervision},
  author={Chih-Yao Ma and Yannis Kalantidis and G. Al-Regib and P{\'e}ter Vajda and Marcus Rohrbach and Z. Kira},
  • Chih-Yao Ma, Yannis Kalantidis, +3 authors Z. Kira
  • Published 2019
  • Computer Science
  • ArXiv
  • When generating a sentence description for an image, it frequently remains unclear how well the generated caption is grounded in the image or if the model hallucinates based on priors in the dataset and/or the language model. [...] Key Method In this work, we propose a novel cyclical training regimen that forces the model to localize each word in the image after the sentence decoder generates it and then reconstruct the sentence from the localized image region(s) to match the ground-truth.Expand Abstract
    4 Citations

    Figures, Tables, and Topics from this paper

    More Grounded Image Captioning by Distilling Image-Text Matching Model
    • 8
    • Highly Influenced
    • PDF
    Comprehensive Image Captioning via Scene Graph Decomposition
    • PDF
    Sub-Instruction Aware Vision-and-Language Navigation
    • 5
    • Highly Influenced
    • PDF
    Thoracic Disease Identification and Localization using Distance Learning and Region Verification
    • 2
    • PDF


    Generating Descriptions with Grounded and Co-referenced People
    • 37
    • PDF
    Grounding of Textual Phrases in Images by Reconstruction
    • 293
    • PDF
    Neural Baby Talk
    • 207
    • PDF
    Show and tell: A neural image caption generator
    • 3,635
    • PDF
    Attention Correctness in Neural Image Captioning
    • 142
    • PDF
    Jointly Localizing and Describing Events for Dense Video Captioning
    • 58
    • PDF
    Adversarial Inference for Multi-Sentence Video Description
    • 11
    • PDF
    Captioning Images with Diverse Objects
    • 96
    • PDF
    Weakly-Supervised Visual Grounding of Phrases with Linguistic Structures
    • 71
    • PDF
    Mind's eye: A recurrent visual representation for image caption generation
    • 387
    • PDF