Finding Domain-Specific Grounding in Noisy Visual-Textual Documents

@article{Yauney2020FindingDG,
  title={Finding Domain-Specific Grounding in Noisy Visual-Textual Documents},
  author={Gregory Yauney and Jack Hessel and David Mimno},
  journal={ArXiv},
  year={2020},
  volume={abs/2010.16363}
}
Images can give us insights into the contextual meanings of words, but current image-text grounding approaches require detailed annotations. Such granular annotation is rare, expensive, and unavailable in most domain-specific contexts. In contrast, unlabeled multi-image, multi-sentence documents are abundant. Can lexical grounding be learned from such documents, even though they have significant lexical and visual overlap? Working with a case study dataset of real estate listings, we… 

Figures and Tables from this paper

References

SHOWING 1-10 OF 30 REFERENCES

Unsupervised Discovery of Multimodal Links in Multi-image, Multi-sentence Documents

TLDR
A structured training objective based on identifying whether collections of images and sentences co-occur in documents can suffice to predict links between specific sentences and specific images within the same document at test time is found.

Quantifying the Visual Concreteness of Words and Topics in Multimodal Datasets

TLDR
This work gives an algorithm for automatically computing the visual concreteness of words and topics within multimodal datasets, and predicts the capacity of machine learning algorithms to learn textual/visual relationships.

Microsoft COCO: Common Objects in Context

We present a new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene

Automatic image annotation and retrieval using cross-media relevance models

TLDR
The approach shows the usefulness of using formal information retrieval models for the task of image annotation and retrieval by assuming that regions in an image can be described using a small vocabulary of blobs.

Deep Visual-Semantic Alignments for Generating Image Descriptions

  • A. KarpathyLi Fei-Fei
  • Computer Science
    IEEE Transactions on Pattern Analysis and Machine Intelligence
  • 2017
TLDR
A model that generates natural language descriptions of images and their regions based on a novel combination of Convolutional Neural Networks over image regions, bidirectional Recurrent Neural networks over sentences, and a structured objective that aligns the two modalities through a multimodal embedding is presented.

Exploring the Limits of Weakly Supervised Pretraining

TLDR
This paper presents a unique study of transfer learning with large convolutional networks trained to predict hashtags on billions of social media images and shows improvements on several image classification and object detection tasks, and reports the highest ImageNet-1k single-crop, top-1 accuracy to date.

“Caption” as a Coherence Relation: Evidence and Implications

We study verbs in image–text corpora, contrasting caption corpora, where texts are explicitly written to characterize image content, with depiction corpora, where texts and images may stand in more

From Word Embeddings To Document Distances

TLDR
It is demonstrated on eight real world document classification data sets, in comparison with seven state-of-the-art baselines, that the Word Mover's Distance metric leads to unprecedented low k-nearest neighbor document classification error rates.

Sort Story: Sorting Jumbled Images and Captions into Stories

TLDR
This work proposes the task of sequencing -- given a jumbled set of aligned image-caption pairs that belong to a story, the task is to sort them such that the output sequence forms a coherent story, and presents multiple approaches, via unary and pairwise predictions, and their ensemble-based combinations, achieving strong results.

Let Your Photos Talk: Generating Narrative Paragraph for Photo Stream via Bidirectional Attention Recurrent Neural Networks

TLDR
This paper proposes a novel joint learning model that can attend on the discovered semantic relation to produce a sentence sequence and maintain its consistence with the photo stream and integrates the two-step learning components into one single optimization formulation and train the network in an end-to-end manner.