Show, Edit and Tell: A Framework for Editing Image Captions

@article{Sammani2020ShowEA,
  title={Show, Edit and Tell: A Framework for Editing Image Captions},
  author={Fawaz Sammani and Luke Melas-Kyriazi},
  journal={2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2020},
  pages={4807-4815}
}
Most image captioning frameworks generate captions directly from images, learning a mapping from visual features to natural language. However, editing existing captions can be easier than generating new ones from scratch. Intuitively, when editing captions, a model is not required to learn information that is already present in the caption (i.e. sentence structure), enabling it to focus on fixing details (e.g. replacing repetitive words). This paper proposes a novel approach to image captioning… Expand
ReFormer: The Relational Transformer for Image Captioning
  • Xuewen Yang, Yingru Liu, Xin Wang
  • Computer Science
  • ArXiv
  • 2021
TLDR
ReFormer incorporates the objective of scene graph generation with that of image captioning using one modified Transformer model and allows ReFormer to generate not only better image captions with the benefit of extracting strong relational image features, but also scene graphs to explicitly describe the pair-wise relationships. Expand
Fusion Models for Improved Visual Captioning
TLDR
A generic multimodal model fusion framework for caption generation as well as emendation where different fusion strategies to integrate a pretrained Auxiliary Language Model (AuxLM) within the traditional encoder-decoder visual captioning frameworks are proposed. Expand
Weakly Supervised Content Selection for Improved Image Captioning
TLDR
A compositional model that uses skeletons as a knob to control certain properties of the generated image caption, such as length, content, and gender expression, generates significantly better quality captions on out of domain test images, as judged by human annotators. Expand
Neighbours Matter: Image Captioning with Similar Images
TLDR
This paper proposes an image captioning model based on KNN graphs composed of the input image and its similar images, where each node denotes an image or a caption, and an attention-in-attention (AiA) model is developed to refine the node representations. Expand
A Survey on Recent Advances in Image Captioning
Image captioning, an interdisciplinary research field of computer vision and natural language processing, has attracted extensive attention. Image captioning aims to produce reasonable and accurateExpand
Non-Autoregressive Video Captioning with Iterative Refinement
TLDR
This paper proposes a non-autoregressive video captioning (NAVC) model with iterative refinement, and proposes to exploit external auxiliary scoring information to assist the Iterative refinement process, which can help the model focus on the inappropriate words more accurately. Expand
Emerging Trends of Multimodal Research in Vision and Language
TLDR
A detailed overview of the latest trends in research pertaining to visual and language modalities is presented, looking at its applications in their task formulations and how to solve various problems related to semantic perception and content generation. Expand
Iterative Shrinking for Referring Expression Grounding Using Deep Reinforcement Learning
TLDR
This paper proposes an iterative shrinking mechanism to localize the target, where the shrinking direction is decided by a reinforcement learning agent, with all contents within the current image patch comprehensively considered. Expand
Journalistic Guidelines Aware News Image Captioning
The task of news article image captioning aims to generate descriptive and informative captions for news article images. Unlike conventional image captions that simply describe the content of theExpand
Non-Autoregressive Coarse-to-Fine Video Captioning
TLDR
This paper proposes a non-autoregressive decoding based model with a coarse-to-fine captioning procedure that achieves state-of-the-art performance, generates diverse descriptions, and obtains high inference efficiency. Expand

References

SHOWING 1-10 OF 38 REFERENCES
Look and Modify: Modification Networks for Image Captioning
Attention-based neural encoder-decoder frameworks have been widely used for image captioning. Many of these frameworks deploy their full focus on generating the caption from scratch by relying solelyExpand
Stack-Captioning: Coarse-to-Fine Learning for Image Captioning
The existing image captioning approaches typically train a one-stage sentence decoder, which is difficult to generate rich fine-grained descriptions. On the other hand, multi-stage image captionExpand
Show and tell: A neural image caption generator
Automatically describing the content of an image is a fundamental problem in artificial intelligence that connects computer vision and natural language processing. In this paper, we present aExpand
Self-Critical Sequence Training for Image Captioning
Image captioning aims at generating a natural language description of an image. Open domain captioning is a very challenging task, as it requires a fine-grained understanding of the global and theExpand
Self-Critical Sequence Training for Image Captioning
Image captioning aims at generating a natural language description of an image. Open domain captioning is a very challenging task, as it requires a fine-grained understanding of the global and theExpand
Exploring Visual Relationship for Image Captioning
It is always well believed that modeling relationships between objects would be helpful for representing and eventually describing an image. Nevertheless, there has not been evidence in support ofExpand
Knowing When to Look: Adaptive Attention via a Visual Sentinel for Image Captioning
Attention-based neural encoder-decoder frameworks have been widely adopted for image captioning. Most methods force visual attention to be active for every generated word. However, the decoder likelyExpand
Recurrent Fusion Network for Image Captioning
Recently, much advance has been made in image captioning, and an encoder-decoder framework has been adopted by all the state-of-the-art models. Under this framework, an input image is encoded by aExpand
Attention on Attention for Image Captioning
Attention mechanisms are widely used in current encoder/decoder frameworks of image captioning, where a weighted average on encoded vectors is generated at each time step to guide the captionExpand
Boosting Image Captioning with Attributes
Automatically describing an image with a natural language has been an emerging challenge in both fields of computer vision and natural language processing. In this paper, we present Long Short-TermExpand
...
1
2
3
4
...