Corpus ID: 211817758

XGPT: Cross-modal Generative Pre-Training for Image Captioning

@article{Xia2020XGPTCG,
  title={XGPT: Cross-modal Generative Pre-Training for Image Captioning},
  author={Qiaolin Xia and H. Huang and Nan Duan and Dongdong Zhang and Lei Ji and Zhifang Sui and Edward Cui and Taroon Bharti and Xin Liu and M. Zhou},
  journal={ArXiv},
  year={2020},
  volume={abs/2003.01473}
}
While many BERT-based cross-modal pre-trained models produce excellent results on downstream understanding tasks like image-text retrieval and VQA, they cannot be applied to generation tasks directly. In this paper, we propose XGPT, a new method of Cross-modal Generative Pre-Training for Image Captioning that is designed to pre-train text-to-image caption generators through three novel generation tasks, including Image-conditioned Masked Language Modeling (IMLM), Image-conditioned Denoising… Expand
Iconographic Image Captioning for Artworks
WenLan: Bridging Vision and Language by Large-Scale Multi-Modal Pre-Training
Kaleido-BERT: Vision-Language Pre-training on Fashion Domain
Pretrained Language Models for Text Generation: A Survey
Pre-Trained Models: Past, Present and Future
  • Xu Han, Zhengyan Zhang, +19 authors Jun Zhu
  • Computer Science
  • ArXiv
  • 2021
Modeling Text-visual Mutual Dependency for Multi-modal Dialog Generation
...
1
2
...

References

SHOWING 1-10 OF 36 REFERENCES
Neural Baby Talk
Unified Vision-Language Pre-Training for Image Captioning and VQA
Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training
BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension
Attention on Attention for Image Captioning
Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering
VideoBERT: A Joint Model for Video and Language Representation Learning
...
1
2
3
4
...