Towards Multimodal Vision-Language Models Generating Non-Generic Text

@inproceedings{Robbins2022TowardsMV,
  title={Towards Multimodal Vision-Language Models Generating Non-Generic Text},
  author={Wes Robbins},
  booktitle={ICON},
  year={2022}
}
  • Wes Robbins
  • Published in ICON 28 June 2022
  • Computer Science
Vision-language models can assess visual context in an image and generate descriptive text. While the generated text may be accurate and syntactically correct, it is often overly general. To address this, recent work has used optical character recognition to supplement visual information with text extracted from an image. In this work, we contend that vision-language models can benefit from information that can be extracted from an image, but are not used by current models. We modify previous… 

Figures and Tables from this paper

References

SHOWING 1-10 OF 37 REFERENCES

Simple is not Easy: A Simple Strong Baseline for TextVQA and TextCaps

This work argues that a simple attention mechanism can do the same or even better job without any bells and whistles on state-of-the-art SOTA models, and sets the new baseline for these two OCR text related applications and to inspire new thinking of multi-modality encoder design.

UNITER: UNiversal Image-TExt Representation Learning

UNITER, a UNiversal Image-TExt Representation, learned through large-scale pre-training over four image-text datasets is introduced, which can power heterogeneous downstream V+L tasks with joint multimodal embeddings.

Show and tell: A neural image caption generator

This paper presents a generative model based on a deep recurrent architecture that combines recent advances in computer vision and machine translation and that can be used to generate natural sentences describing an image.

Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks

This paper proposes a new learning method Oscar (Object-Semantics Aligned Pre-training), which uses object tags detected in images as anchor points to significantly ease the learning of alignments.

TextCaps: a Dataset for Image Captioning with Reading Comprehension

A novel dataset, TextCaps, with 145k captions for 28k images, challenges a model to recognize text, relate it to its visual context, and decide what part of the text to copy or paraphrase, requiring spatial, semantic, and visual reasoning between multiple text tokens and visual entities, such as objects.

Multi-Modal Graph Neural Network for Joint Reasoning on Vision and Scene Text

A novel VQA approach that represents an image as a graph consisting of three sub-graphs, depicting visual, semantic, and numeric modalities respectively, and introduces three aggregators which guide the message passing from one graph to another to utilize the contexts in various modalities, so as to refine the features of nodes.

TAP: Text-Aware Pre-training for Text-VQA and Text-Caption

This paper proposes Text-Aware Pre-training (TAP) for Text-VQA and Text-Caption tasks, and builds a large-scale scene text-related imagetext dataset based on the Conceptual Caption dataset, named OCR-CC, which contains 1:4 million images with scene text.

Structured Multimodal Attentions for TextVQA

  • Chenyu GaoQi Zhu Qi Wu
  • Computer Science
    IEEE Transactions on Pattern Analysis and Machine Intelligence
  • 2022
This paper proposes an end-to-end structured multimodal attention (SMA) neural network that outperforms the SoTA models on TextVQA dataset and two tasks of ST-VQ a dataset among all models except pre-training based TAP.

Multimodal Neural Language Models

This work introduces two multimodal neural language models: models of natural language that can be conditioned on other modalities and imagetext modelling, which can generate sentence descriptions for images without the use of templates, structured prediction, and/or syntactic trees.

Informative Image Captioning with External Sources of Information

This work introduces a multimodal, multi-encoder model based on Transformer that ingests both image features and multiple sources of entity labels and demonstrates that it can learn to control the appearance of these entity labels in the output, resulting in captions that are both fluent and informative.