Non-Autoregressive Image Captioning with Counterfactuals-Critical Multi-Agent Learning

@article{Guo2020NonAutoregressiveIC,
  title={Non-Autoregressive Image Captioning with Counterfactuals-Critical Multi-Agent Learning},
  author={Longteng Guo and Jing Liu and Xinxin Zhu and Xingjian He and Jie Jiang and Hanqing Lu},
  journal={ArXiv},
  year={2020},
  volume={abs/2005.04690}
}
Most image captioning models are autoregressive, i.e. they generate each word by conditioning on previously generated words, which leads to heavy latency during inference. Recently, non-autoregressive decoding has been proposed in machine translation to speed up the inference time by generating all words in parallel. Typically, these models use the word-level cross-entropy loss to optimize each word independently. However, such a learning process fails to consider the sentence-level consistency… Expand
Partially Non-Autoregressive Image Captioning
TLDR
A partially non-autoregressive model, named PNAIC, is introduced, which considers a caption as a series of concatenated word groups, and is capable of generating accurate captions as well as preventing common incoherent errors. Expand
Self-Distillation for Few-Shot Image Captioning
  • Xianyu Chen, Ming Jiang, Qi Zhao
  • Computer Science
  • 2021 IEEE Winter Conference on Applications of Computer Vision (WACV)
  • 2021
TLDR
An ensemble- based self-distillation method that allows image captioning models to be trained with unpaired images and captions and a simple yet effective pseudo feature generation method based on Gradient Descent is proposed. Expand
Contrastive Semantic Similarity Learning for Image Captioning Evaluation with Intrinsic Auto-encoder
TLDR
This work proposes a learning-based metric for image captioning, which it is called Intrinsic Image Captioning Evaluation (ICE), and develops three progressive model structures to learn the sentence level representations–single branch model, dual branches model, and triple branches model. Expand
PIMNet: A Parallel, Iterative and Mimicking Network for Scene Text Recognition
TLDR
A Parallel, Iterative and Mimicking Network (PIMNet) to balance accuracy and efficiency is proposed, which adopts a parallel attention mechanism to predict the text faster and an iterative generation mechanism to make the predictions more accurate. Expand
From Show to Tell: A Survey on Image Captioning
TLDR
This work aims at providing a comprehensive overview of image captioning approaches, from visual encoding and text generation to training strategies, datasets, and evaluation metrics, and quantitatively compare many relevant state-of-the-art approaches to identify the most impactful technical innovations in architectures and training strategies. Expand
UFC-BERT: Unifying Multi-Modal Controls for Conditional Image Synthesis
TLDR
A new two-stage architecture to unify any number of multi-modal controls, UFC-BERT, which adopts non-autoregressive generation at the second stage to enhance the holistic consistency of the synthesized image, to support preserving specified image blocks, and to improve the synthesis speed. Expand
M6-UFC: Unifying Multi-Modal Controls for Conditional Image Synthesis
TLDR
A new two-stage architecture, M6UFC, is proposed to unify any number of multi-modal controls to enhance the holistic consistency of the synthesized image, to support preserving specified image blocks, and to improve the synthesis speed. Expand
Emerging Trends of Multimodal Research in Vision and Language
TLDR
A detailed overview of the latest trends in research pertaining to visual and language modalities is presented, looking at its applications in their task formulations and how to solve various problems related to semantic perception and content generation. Expand
Semi-Autoregressive Image Captioning
  • Xu Yan, Zhengcong Fei, Zekang Li, Shuhui Wang, Qingming Huang, Qi Tian
  • Computer Science
  • 2021
TLDR
Experimental results on the MS COCO benchmark demonstrate that the proposed Semi-Autoregressive Image Captioning (SAIC) model outperforms the preceding non-autoregressive image captioning models while obtaining a competitive inference speedup. Expand
Multimodal research in vision and language: A review of current and emerging trends
Abstract Deep Learning and its applications have cascaded impactful research and development with a diverse range of modalities present in the real-world data. More recently, this has enhancedExpand

References

SHOWING 1-10 OF 39 REFERENCES
Masked Non-Autoregressive Image Captioning
TLDR
This paper proposes masked non-autoregressive decoding, which masks several kinds of ratios of the input sequences during training, and generates captions parallelly in several stages from a totally masked sequence to a totally non-masked sequence in a compositional manner during inference. Expand
Self-Critical Sequence Training for Image Captioning
TLDR
This paper considers the problem of optimizing image captioning systems using reinforcement learning, and shows that by carefully optimizing systems using the test metrics of the MSCOCO task, significant gains in performance can be realized. Expand
Fast Image Caption Generation with Position Alignment
  • Z. Fei
  • Computer Science
  • ArXiv
  • 2019
TLDR
This work introduces an inference strategy that regards position information as a latent variable to guide the further sentence generation and achieves better performance compared to general NA captioning models, while achieves comparable performance as autoregressive image captioned models with a significant speedup. Expand
Non-Autoregressive Neural Machine Translation
TLDR
A model is introduced that avoids this autoregressive property and produces its outputs in parallel, allowing an order of magnitude lower latency during inference, and achieves near-state-of-the-art performance on WMT 2016 English-Romanian. Expand
Non-Autoregressive Neural Machine Translation with Enhanced Decoder Input
TLDR
This paper proposes two methods to enhance the decoder inputs so as to improve NAT models, one directly leverages a phrase table generated by conventional SMT approaches to translate source tokens to target tokens, and the other transforms source-side word embeddings to target-side words through sentence-level alignment and word-level adversary learning. Expand
Non-Autoregressive Machine Translation with Auxiliary Regularization
TLDR
This paper proposes to address the issues of repeated translations and incomplete translations in NAT models by improving the quality of decoder hidden representations via two auxiliary regularization terms in the training process of an NAT model. Expand
Counterfactual Critic Multi-Agent Training for Scene Graph Generation
TLDR
CMAT is a multi-agent policy gradient method that frames objects into cooperative agents, and then directly maximizes a graph-level metric as the reward, and uses a counterfactual baseline that disentangles the agent-specific reward by fixing the predictions of other agents. Expand
Entangled Transformer for Image Captioning
TLDR
A Transformer-based sequence modeling framework built only with attention layers and feedforward layers that enables the Transformer to exploit semantic and visual information simultaneously and achieves state-of-the-art performance on the MSCOCO image captioning dataset. Expand
Show and Tell: Lessons Learned from the 2015 MSCOCO Image Captioning Challenge
TLDR
A generative model based on a deep recurrent architecture that combines recent advances in computer vision and machine translation and that can be used to generate natural sentences describing an image is presented. Expand
MSCap: Multi-Style Image Captioning With Unpaired Stylized Text
TLDR
An adversarial learning network is proposed for the task of multi-style image captioning (MSCap) with a standard factual image caption dataset and a multi-stylized language corpus without paired images to enable more natural and human-like captions. Expand
...
1
2
3
4
...