Corpus ID: 235458223

Semi-Autoregressive Transformer for Image Captioning

  title={Semi-Autoregressive Transformer for Image Captioning},
  author={Yuanen Zhou and Yong Zhang and Zhenzhen Hu and Meng Wang},
  • Yuanen Zhou, Yong Zhang, +1 author Meng Wang
  • Published 2021
  • Computer Science
  • ArXiv
Current state-of-the-art image captioning models adopt autoregressive decoders, i.e. they generate each word by conditioning on previously generated words, which leads to heavy latency during inference. To tackle this issue, non-autoregressive image captioning models have recently been proposed to significantly accelerate the speed of inference by generating all words in parallel. However, these non-autoregressive models inevitably suffer from large generation quality degradation since they… Expand

Figures and Tables from this paper


Partially Non-Autoregressive Image Captioning
A partially non-autoregressive model, named PNAIC, is introduced, which considers a caption as a series of concatenated word groups, and is capable of generating accurate captions as well as preventing common incoherent errors. Expand
Non-Autoregressive Image Captioning with Counterfactuals-Critical Multi-Agent Learning
This paper proposes a Non-Autoregressive Image Captioning (NAIC) model with a novel training paradigm: Counterfactuals-critical Multi-Agent Learning (CMAL), which formulates NAIC as a multi-agent reinforcement learning system where positions in the target sequence are viewed as agents that learn to cooperatively maximize a sentence-level reward. Expand
Fast Image Caption Generation with Position Alignment
  • Z. Fei
  • Computer Science
  • ArXiv
  • 2019
This work introduces an inference strategy that regards position information as a latent variable to guide the further sentence generation and achieves better performance compared to general NA captioning models, while achieves comparable performance as autoregressive image captioned models with a significant speedup. Expand
Meshed-Memory Transformer for Image Captioning
The architecture improves both the image encoding and the language generation steps: it learns a multi-level representation of the relationships between image regions integrating learned a priori knowledge, and uses a mesh-like connectivity at decoding stage to exploit low- and high-level features. Expand
Semi-Autoregressive Neural Machine Translation
A novel model for fast sequence generation — the semi-autoregressive Transformer (SAT), which keeps the autoregressive property in global but relieves in local and thus are able to produce multiple successive words in parallel at each time step. Expand
Non-Autoregressive Neural Machine Translation with Enhanced Decoder Input
This paper proposes two methods to enhance the decoder inputs so as to improve NAT models, one directly leverages a phrase table generated by conventional SMT approaches to translate source tokens to target tokens, and the other transforms source-side word embeddings to target-side words through sentence-level alignment and word-level adversary learning. Expand
Non-Autoregressive Machine Translation with Auxiliary Regularization
This paper proposes to address the issues of repeated translations and incomplete translations in NAT models by improving the quality of decoder hidden representations via two auxiliary regularization terms in the training process of an NAT model. Expand
An Empirical Study of Language CNN for Image Captioning
This paper introduces a language CNN model which is suitable for statistical language modeling tasks and shows competitive performance in image captioning, and is competitive with the state-of-the-art methods. Expand
Self-Critical Sequence Training for Image Captioning
This paper considers the problem of optimizing image captioning systems using reinforcement learning, and shows that by carefully optimizing systems using the test metrics of the MSCOCO task, significant gains in performance can be realized. Expand
Unified Vision-Language Pre-Training for Image Captioning and VQA
VLP is the first reported model that achieves state-of-the-art results on both vision-language generation and understanding tasks, as disparate as image captioning and visual question answering, across three challenging benchmark datasets: COCO Captions, Flickr30kCaptions, and VQA 2.0. Expand