Connecting Vision and Language with Localized Narratives

@article{PontTuset2020ConnectingVA,
  title={Connecting Vision and Language with Localized Narratives},
  author={J. Pont-Tuset and J. Uijlings and Soravit Changpinyo and Radu Soricut and V. Ferrari},
  journal={ArXiv},
  year={2020},
  volume={abs/1912.03098}
}
We propose Localized Narratives, a new form of multimodal image annotations connecting vision and language. We ask annotators to describe an image with their voice while simultaneously hovering their mouse over the region they are describing. Since the voice and the mouse pointer are synchronized, we can localize every single word in the description. This dense visual grounding takes the form of a mouse trace segment per word and is unique to our data. We annotated 849k images with Localized… Expand
ArtEmis: Affective Language for Visual Art
Fine-Grained Grounding for Multimodal Speech Recognition
Understanding Guided Image Captioning Performance across Domains
PanGEA: The Panoramic Graph Environment Annotation Toolkit
Human-like Controllable Image Captioning with Verb-specific Semantic Roles
StacMR: Scene-Text Aware Cross-Modal Retrieval
Adversarial Text-to-Image Synthesis: A Review
...
1
2
3
...

References

SHOWING 1-10 OF 84 REFERENCES
A Hierarchical Approach for Generating Descriptive Image Paragraphs
SNAG: Spoken Narratives and Gaze Dataset
Efficient Object Annotation via Speaking and Pointing
Generation and Comprehension of Unambiguous Object Descriptions
Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models
Show and tell: A neural image caption generator
Neural Baby Talk
Grounded Video Description
DenseCap: Fully Convolutional Localization Networks for Dense Captioning
...
1
2
3
4
5
...