• Corpus ID: 235351128

VisualGPT: Data-efficient Adaptation of Pretrained Language Models for Image Captioning

  title={VisualGPT: Data-efficient Adaptation of Pretrained Language Models for Image Captioning},
  author={Jun Chen and Han Guo and Kai Yi and Boyang Li and Mohamed Elhoseiny},
The limited availability of annotated data often hinders real-world applications of machine learning. To efficiently learn from small quantities of multimodal data, we lever-age the linguistic knowledge from a large pre-trained language model (PLM) and quickly adapt it to new domains of image captioning. To effectively utilize a pretrained model, it is critical to balance the visual input and prior linguistic knowledge from pretraining. We propose VisualGPT, which employs a novel self… 
Flamingo: a Visual Language Model for Few-Shot Learning
It is demonstrated that a single Flamingo model can achieve a new state of the art for few-shot learning, simply by prompting the model with task-specific examples.
Multimodal Few-Shot Learning with Frozen Language Models
The resulting system is a multimodal few-shot learner, with the surprising ability to learn a variety of new tasks when conditioned on examples, represented as a sequence of multiple interleaved image and text embeddings.
VC-GPT: Visual Conditioned GPT for End-to-End Generative Vision-and-Language Pre-training
A novel G-VLP framework, Visual Conditioned GPT (VC-GPT), is proposed, which achieves either the best or the second-best performance across all evaluation metrics over the previous works which consume around 30 times more distinct images during cross-modal pre-training.
A Survey of Pretrained Language Models Based Text Generation
This survey presents the recent advances achieved in the topic of PLMs for text generation and introduces three key points of applying PLMs to text generation: how to encode the input data as representations preserving input semantics which can be fused into PLMs.
Language Models Can See: Plugging Visual Controls in Text Generation
A training-free framework for plugging in visual controls in the generation process and enabling LMs to perform multimodal tasks (e.g., image captioning) in a zero-shot manner, which outperforms the state-of-the-art method by notable margins with a nearly 27 times decoding speedup.
FS-COCO: Towards Understanding of Freehand Sketches of Common Objects in Context
This work collects sketches that convey well scene content but can be sketched within a few minutes by a person with any sketching skills, and proposes a hierarchical sketch decoder, which is leverage at a sketch-specific “pretext” task.
Pretrained Language Models for Text Generation: A Survey
This paper presents an overview of the major advances achieved in the topic of pretrained language models for text generation, and discusses how to adapt existing PLMs to model different input data and satisfy special properties in the generated text.
Transflower: probabilistic autoregressive dance generation with multimodal attention
This work presents a novel probabilistic autoregressive architecture that models the distribution over future poses with a normalizing flow conditioned on previous poses as well as music context, using a multimodal transformer encoder.


Knowing When to Look: Adaptive Attention via a Visual Sentinel for Image Captioning
This paper proposes a novel adaptive attention model with a visual sentinel that sets the new state-of-the-art by a significant margin on image captioning.
Captioning Images with Diverse Objects
The Novel Object Captioner (NOC) is proposed, a deep visual semantic captioning model that can describe a large number of object categories not present in existing image-caption datasets, taking advantage of external sources, labeled images from object recognition datasets, and semantic knowledge extracted from unannotated text.
Exploring Visual Relationship for Image Captioning
This paper introduces a new design to explore the connections between objects for image captioning under the umbrella of attention-based encoder-decoder framework that novelly integrates both semantic and spatial object relationships into image encoder.
Attend to You: Personalized Image Captioning with Context Sequence Memory Networks
This work proposes a novel captioning model named Context Sequence Memory Network (CSMN), and shows the effectiveness of the three novel features of CSMN and its performance enhancement for personalized image captioning over state-of-the-art captioning models.
Kaleido-BERT: Vision-Language Pre-training on Fashion Domain
  • Mingchen Zhuge, D. Gao, L. Shao
  • Computer Science
    2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
  • 2021
A new vision-language (VL) pre-training model dubbed Kaleido-BERT is presented, which introduces a novel kaleido strategy for fashion cross-modality representations from transformers, and design alignment guided masking to jointly focus more on image-text semantic relations.
Improving Language Understanding by Generative Pre-Training
The general task-agnostic model outperforms discriminatively trained models that use architectures specifically crafted for each task, improving upon the state of the art in 9 out of the 12 tasks studied.
Show and Tell: Lessons Learned from the 2015 MSCOCO Image Captioning Challenge
A generative model based on a deep recurrent architecture that combines recent advances in computer vision and machine translation and that can be used to generate natural sentences describing an image is presented.
Unsupervised Image Captioning
  • Yang Feng, Lin Ma, Wei Liu, Jiebo Luo
  • Computer Science
    2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
  • 2019
This paper makes the first attempt to train an image captioning model in an unsupervised manner, and requires an image set, a sentence corpus, and an existing visual concept detector.
Pointing Novel Objects in Image Captioning
This paper presents Long Short-Term Memory with Pointing (LSTM-P) --- a new architecture that facilitates vocabulary expansion and produces novel objects via pointing mechanism by augmenting standard deep captioning architectures with object learners.
CNN+CNN: Convolutional Decoders for Image Captioning
This paper proposes a framework that only employs convolutional neural networks (CNNs) to generate captions and achieves comparable scores of BLEU-1,2,3,4 and METEOR, and higher scores of CIDEr.