Learning to Collocate Visual-Linguistic Neural Modules for Image Captioning

@article{Yang2022LearningTC,
  title={Learning to Collocate Visual-Linguistic Neural Modules for Image Captioning},
  author={Xu Yang and Hanwang Zhang and Chongyang Gao and Jianfei Cai},
  journal={International Journal of Computer Vision},
  year={2022},
  volume={131},
  pages={82-100}
}
Humans tend to decompose a sentence into different parts like sth do sth at someplace and then fill each part with certain content. Inspired by this, we follow the principle of modular design to propose a novel image captioner: learning to Collocate Visual-Linguistic Neural Modules (CVLNM). Unlike the widely used neural module networks in VQA, where the language (i.e., question) is fully observable, the task of collocating visual-linguistic modules is more challenging. This is because the… 

References

SHOWING 1-10 OF 94 REFERENCES

Learning to Collocate Neural Modules for Image Captioning

This work proposes a novel framework: learning to Collocate Neural Modules (CNM), to generate the ``inner pattern'' connecting visual encoder and language decoder, and achieves a new state-of-the-art 127.9 CIDEr-D on Karpathy split and a single-model 126.0 c40 on the official server.

Learning to Assemble Neural Module Tree Networks for Visual Grounding

A novel modular network called Neural Module Tree network (NMTree) is developed that regularizes the visual grounding along the dependency parsing tree of the sentence, where each node is a neural module that calculates visual attention according to its linguistic feature, and the grounding score is accumulated in a bottom-up direction where as needed.

Show and tell: A neural image caption generator

This paper presents a generative model based on a deep recurrent architecture that combines recent advances in computer vision and machine translation and that can be used to generate natural sentences describing an image.

Knowing When to Look: Adaptive Attention via a Visual Sentinel for Image Captioning

This paper proposes a novel adaptive attention model with a visual sentinel that sets the new state-of-the-art by a significant margin on image captioning.

Explainability by Parsing: Neural Module Tree Networks for Natural Language Visual Grounding

A novel modular network called Neural Module Tree network (NMTree) is developed that regularizes the visual grounding along the dependency parsing tree of the sentence, where each node is a module network that calculates or accumulates the grounding score in a bottom-up direction where as needed.

MAttNet: Modular Attention Network for Referring Expression Comprehension

This work proposes to decompose expressions into three modular components related to subject appearance, location, and relationship to other objects, which allows for flexibly adapt to expressions containing different types of information in an end-to-end framework.

Neural Baby Talk

A novel framework for image captioning that can produce natural language explicitly grounded in entities that object detectors find in the image is introduced and reaches state-of-the-art on both COCO and Flickr30k datasets.

Learning to Reason: End-to-End Module Networks for Visual Question Answering

End-to-End Module Networks are proposed, which learn to reason by directly predicting instance-specific network layouts without the aid of a parser, and achieve an error reduction of nearly 50% relative to state-of-theart attentional approaches.

Exploring Visual Relationship for Image Captioning

This paper introduces a new design to explore the connections between objects for image captioning under the umbrella of attention-based encoder-decoder framework that novelly integrates both semantic and spatial object relationships into image encoder.

M2: Meshed-Memory Transformer for Image Captioning

The architecture improves both the image encoding and the language generation steps: it learns a multi-level representation of the relationships between image regions integrating learned a priori knowledge, and uses a mesh-like connectivity at decoding stage to exploit lowand high-level features.
...