A Diagram is Worth a Dozen Images

@article{Kembhavi2016ADI,
  title={A Diagram is Worth a Dozen Images},
  author={Aniruddha Kembhavi and Michael Salvato and Eric Kolve and Minjoon Seo and Hannaneh Hajishirzi and Ali Farhadi},
  journal={ArXiv},
  year={2016},
  volume={abs/1603.07396}
}
Diagrams are common tools for representing complex concepts, relationships and events, often when it would be difficult to portray the same information with natural images. [] Key Method We define syntactic parsing of diagrams as learning to infer DPGs for diagrams and study semantic interpretation and reasoning of diagrams in the context of diagram question answering. We devise an LSTM-based method for syntactic parsing of diagrams and introduce a DPG-based attention model for diagram question answering.

Figures and Tables from this paper

Computer Science Diagram Understanding with Topology Parsing
TLDR
This paper constructs the first novel geometric type of diagrams dataset in Computer Science field, which has more abstract expressions and complex logical relations, and proposes the Diagram Paring Net (DPN) that focuses on analyzing the topological structure and text information of diagrams.
Dynamic Graph Generation Network: Generating Relational Knowledge from Diagrams
TLDR
A unified diagram-parsing network for generating knowledge from diagrams based on an object detector and a recurrent neural network designed for a graphical structure is proposed that is based on dynamic memory and graph theory.
RL-CSDia: Representation Learning of Computer Science Diagrams
TLDR
A novel dataset of graphic diagrams named Computer Science Diagrams (CSDia), which contains more than 1,200 diagrams and exhaustive annotations of objects and relations is constructed, and the effectiveness of the proposed Diagram Parsing Net (DPN) on diagrams understanding is shown.
IconQA: A New Benchmark for Abstract Diagram Understanding and Visual Language Reasoning
TLDR
A new challenge of Icon Question Answering (IconQA) with the goal of answering a question in an icon image context is introduced and a strong IconQA baseline Patch-TRM is developed that applies a pyramid cross-modal Transformer with input diagram embeddings pre-trained on the icon dataset.
Diag2graph: Representing Deep Learning Diagrams In Research Papers As Knowledge Graphs
TLDR
Diag2Graph is introduced, an end-to-end framework for parsing deep learning diagram-figures that enables powerful search and retrieval of architectural details in research papers and is represented in the form of a deep knowledge graph.
Hierarchical Multi-Task Learning for Diagram Question Answering with Multi-Modal Transformer
TLDR
This paper proposes a novel structural parsing-integrated Hierarchical Multi-Task Learning (HMTL) model for diagram question answering based on a multi-modal transformer framework and demonstrates the effectiveness of the proposed HMTL over other state-of-the-art methods.
Look, Read and Enrich - Learning from Scientific Figures and their Captions
TLDR
This paper investigates what can be learnt by looking at a large number of figures and reading their captions, and introduces a figure-caption correspondence learning task that makes use of observations, and demonstrates the positive impact of such features in other tasks involving scientific text and figures.
Enhancing the AI2 Diagrams Dataset Using Rhetorical Structure Theory
TLDR
The proposed annotation schema is based on Rhetorical Structure Theory (RST), which has been previously used to describe the multimodal structure of diagrams and entire documents and the use of AI2D-RST for research on multimodality and artificial intelligence.
DynGraph: Visual Question Answering via Dynamic Scene Graphs
TLDR
This work proposes a structured approach for VQA that is based on dynamic graphs learned automatically from the input that can be trained end-to-end and does not require additional training labels in the form of predefined graphs or relations.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 61 REFERENCES
Diagram Understanding in Geometry Questions
TLDR
This paper presents a method for diagram understanding that identifies visual elements in a diagram while maximizing agreement between textual and visual data, and shows that the method's objective function is submodular.
Bringing Semantics into Focus Using Visual Abstraction
TLDR
This paper creates 1,002 sets of 10 semantically similar abstract scenes with corresponding written descriptions and thoroughly analyzes this dataset to discover semantically important features, the relations of words to visual features and methods for measuring semantic similarity.
Yin and Yang: Balancing and Answering Binary Visual Questions
TLDR
This paper addresses binary Visual Question Answering on abstract scenes as visual verification of concepts inquired in the questions by converting the question to a tuple that concisely summarizes the visual concept to be detected in the image.
Learning Common Sense through Visual Abstraction
TLDR
The use of human-generated abstract scenes made from clipart for learning common sense is explored and it is shown that the commonsense knowledge the authors learn is complementary to what can be learnt from sources of text.
Modeling Biological Processes for Reading Comprehension
TLDR
This paper focuses on a new reading comprehension task that requires complex reasoning over a single document, and demonstrates that answering questions via predicted structures substantially improves accuracy over baselines that use shallower representations.
Solving Geometry Problems: Combining Text and Diagram Interpretation
TLDR
GEOS is introduced, the first automated system to solve unaltered SAT geometry questions by combining text understanding and diagram interpretation, and it is shown that by integrating textual and visual information, GEOS boosts the accuracy of dependency and semantic parsing of the question text.
Are You Talking to a Machine? Dataset and Methods for Multilingual Image Question
TLDR
The mQA model, which is able to answer questions about the content of an image, is presented, which contains four components: a Long Short-Term Memory (LSTM), a Convolutional Neural Network (CNN), an LSTM for storing the linguistic context in an answer, and a fusing component to combine the information from the first three components and generate the answer.
Extraction,layout analysis and classification of diagrams in PDF documents
TLDR
Separating a set of bargraphs from non-bar-graphs gathered from 20,000biology research papers gave a classification accuracy of91.7%.
VQA: Visual Question Answering
We propose the task of free-form and open-ended Visual Question Answering (VQA). Given an image and a natural language question about the image, the task is to provide an accurate natural language
Visual7W: Grounded Question Answering in Images
TLDR
A semantic link between textual descriptions and image regions by object-level grounding enables a new type of QA with visual answers, in addition to textual answers used in previous work, and proposes a novel LSTM model with spatial attention to tackle the 7W QA tasks.
...
1
2
3
4
5
...