A Diagram is Worth a Dozen Images

@article{Kembhavi2016ADI,
  title={A Diagram is Worth a Dozen Images},
  author={Aniruddha Kembhavi and M. Salvato and Eric Kolve and Minjoon Seo and Hannaneh Hajishirzi and Ali Farhadi},
  journal={ArXiv},
  year={2016},
  volume={abs/1603.07396}
}
Diagrams are common tools for representing complex concepts, relationships and events, often when it would be difficult to portray the same information with natural images. [...] Key Method We define syntactic parsing of diagrams as learning to infer DPGs for diagrams and study semantic interpretation and reasoning of diagrams in the context of diagram question answering. We devise an LSTM-based method for syntactic parsing of diagrams and introduce a DPG-based attention model for diagram question answering.Expand
Dynamic Graph Generation Network: Generating Relational Knowledge from Diagrams
TLDR
A unified diagram-parsing network for generating knowledge from diagrams based on an object detector and a recurrent neural network designed for a graphical structure is proposed that is based on dynamic memory and graph theory. Expand
RL-CSDia: Representation Learning of Computer Science Diagrams
TLDR
A novel dataset of graphic diagrams named Computer Science Diagrams (CSDia), which contains more than 1,200 diagrams and exhaustive annotations of objects and relations is constructed, and the effectiveness of the proposed Diagram Parsing Net (DPN) on diagrams understanding is shown. Expand
Diag2graph: Representing Deep Learning Diagrams In Research Papers As Knowledge Graphs
TLDR
Diag2Graph is introduced, an end-to-end framework for parsing deep learning diagram-figures that enables powerful search and retrieval of architectural details in research papers and is represented in the form of a deep knowledge graph. Expand
Look, Read and Enrich - Learning from Scientific Figures and their Captions
TLDR
This paper investigates what can be learnt by looking at a large number of figures and reading their captions, and introduces a figure-caption correspondence learning task that makes use of observations, and demonstrates the positive impact of such features in other tasks involving scientific text and figures. Expand
Enhancing the AI2 Diagrams Dataset Using Rhetorical Structure Theory
TLDR
The proposed annotation schema is based on Rhetorical Structure Theory (RST), which has been previously used to describe the multimodal structure of diagrams and entire documents and the use of AI2D-RST for research on multimodality and artificial intelligence. Expand
DynGraph: Visual Question Answering via Dynamic Scene Graphs
TLDR
This work proposes a structured approach for VQA that is based on dynamic graphs learned automatically from the input that can be trained end-to-end and does not require additional training labels in the form of predefined graphs or relations. Expand
MoQA - A Multi-modal Question Answering Architecture
TLDR
The shortcomings of the model are discussed and the reason behind the large gap to human performance is shown, by exploring the distribution of the multiple classes of mistakes that the model makes. Expand
Visual question answering: A survey of methods and datasets
TLDR
The state of the art by comparing modern approaches to VQA, and the common approach of combining convolutional and recurrent neural networks to map images and questions to a common feature space are examined. Expand
Data Interpretation over Plots
TLDR
This work proposes the VOES pipeline and combines it with SAN-VQA to form a hybrid model SAN-VOES, which has an accuracy of 54%, which is the highest amongst all the models the authors trained. Expand
Structured Set Matching Networks for One-Shot Part Labeling
TLDR
The Structured Set Matching Network (SSMN), a structured prediction model that incorporates convolutional neural networks, is introduced for the problem of one-shot part labeling: labeling multiple parts of an object in a target image given only a single source image of that category. Expand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 61 REFERENCES
Diagram Understanding in Geometry Questions
TLDR
This paper presents a method for diagram understanding that identifies visual elements in a diagram while maximizing agreement between textual and visual data, and shows that the method's objective function is submodular. Expand
Bringing Semantics into Focus Using Visual Abstraction
TLDR
This paper creates 1,002 sets of 10 semantically similar abstract scenes with corresponding written descriptions and thoroughly analyzes this dataset to discover semantically important features, the relations of words to visual features and methods for measuring semantic similarity. Expand
Yin and Yang: Balancing and Answering Binary Visual Questions
TLDR
This paper addresses binary Visual Question Answering on abstract scenes as visual verification of concepts inquired in the questions by converting the question to a tuple that concisely summarizes the visual concept to be detected in the image. Expand
Learning Common Sense through Visual Abstraction
TLDR
The use of human-generated abstract scenes made from clipart for learning common sense is explored and it is shown that the commonsense knowledge the authors learn is complementary to what can be learnt from sources of text. Expand
Modeling Biological Processes for Reading Comprehension
TLDR
This paper focuses on a new reading comprehension task that requires complex reasoning over a single document, and demonstrates that answering questions via predicted structures substantially improves accuracy over baselines that use shallower representations. Expand
Solving Geometry Problems: Combining Text and Diagram Interpretation
TLDR
GEOS is introduced, the first automated system to solve unaltered SAT geometry questions by combining text understanding and diagram interpretation, and it is shown that by integrating textual and visual information, GEOS boosts the accuracy of dependency and semantic parsing of the question text. Expand
Are You Talking to a Machine? Dataset and Methods for Multilingual Image Question
TLDR
The mQA model, which is able to answer questions about the content of an image, is presented, which contains four components: a Long Short-Term Memory (LSTM), a Convolutional Neural Network (CNN), an LSTM for storing the linguistic context in an answer, and a fusing component to combine the information from the first three components and generate the answer. Expand
Extraction,layout analysis and classification of diagrams in PDF documents
TLDR
Separating a set of bargraphs from non-bar-graphs gathered from 20,000biology research papers gave a classification accuracy of91.7%. Expand
VQA: Visual Question Answering
We propose the task of free-form and open-ended Visual Question Answering (VQA). Given an image and a natural language question about the image, the task is to provide an accurate natural languageExpand
Visual7W: Grounded Question Answering in Images
TLDR
A semantic link between textual descriptions and image regions by object-level grounding enables a new type of QA with visual answers, in addition to textual answers used in previous work, and proposes a novel LSTM model with spatial attention to tackle the 7W QA tasks. Expand
...
1
2
3
4
5
...