Learning to Reason: End-to-End Module Networks for Visual Question Answering

@article{Hu2017LearningTR,
  title={Learning to Reason: End-to-End Module Networks for Visual Question Answering},
  author={Ronghang Hu and Jacob Andreas and Marcus Rohrbach and Trevor Darrell and Kate Saenko},
  journal={2017 IEEE International Conference on Computer Vision (ICCV)},
  year={2017},
  pages={804-813}
}
Natural language questions are inherently compositional, and many are most easily answered by reasoning about their decomposition into modular sub-problems. For example, to answer “is there an equal number of balls and boxes?” we can look for balls, look for boxes, count them, and compare the results. The recently proposed Neural Module Network (NMN) architecture [3, 2] implements this approach to question answering by parsing questions into linguistic substructures and assembling question… 

Figures and Tables from this paper

Neural Module Networks for Reasoning over Text
TLDR
This work extends Neural module networks by introducing modules that reason over a paragraph of text, performing symbolic reasoning over numbers and dates in a probabilistic and differentiable manner, and proposing an unsupervised auxiliary loss to help extract arguments associated with the events in text.
Self-Adaptive Neural Module Transformer for Visual Question Answering
TLDR
A novel NMN is presented, called Self-Adaptive Neural Module Transformer (SANMT), which adaptively adjusts both of the question feature encoding and the layout decoding by considering intermediate Q&A results and encoding the intermediate results with the given question features by a novel transformer module to generate dynamic question feature embedding which evolves over reasoning steps.
Linguistically Driven Graph Capsule Network for Visual Question Reasoning
TLDR
This work proposes a hierarchical compositional reasoning model called the "Linguistically driven Graph Capsule Network", where the compositional process is guided by the linguistic parse tree, inspired by the property of a capsule network that can carve a tree structure inside a regular convolutional neural network (CNN).
Interpretable Visual Question Answering by Reasoning on Dependency Trees
TLDR
A novel neural network model that performs global reasoning on a dependency tree parsed from the question and is capable of building an interpretable visual question answering (VQA) system that gradually derives image cues following question-driven parse-tree reasoning.
LRTA: A Transparent Neural-Symbolic Reasoning Framework with Modular Supervision for Visual Question Answering
TLDR
LRTA [Look, Read, Think, Answer], a transparent neural-symbolic reasoning framework for visual question answering that solves the problem step-by-step like humans and provides human-readable form of justification at each step is proposed.
Linguistically Routing Capsule Network for Out-of-distribution Visual Question Answering
TLDR
The proposed routing method can improve current VQA models on OOD split without losing performance on the in-domain test data and the experimental results show that the proposed method can be improved without loss of performance.
Explainability by Parsing: Neural Module Tree Networks for Natural Language Visual Grounding
TLDR
A novel modular network called Neural Module Tree network (NMTree) is developed that regularizes the visual grounding along the dependency parsing tree of the sentence, where each node is a module network that calculates or accumulates the grounding score in a bottom-up direction where as needed.
Break It Down: A Question Understanding Benchmark
TLDR
This work introduces a Question Decomposition Meaning Representation (QDMR) for questions, and demonstrates the utility of QDMR by showing that it can be used to improve open-domain question answering on the HotpotQA dataset, and can be deterministically converted to a pseudo-SQL formal language, which can alleviate annotation in semantic parsing applications.
Auto-Parsing Network for Image Captioning and Visual Question Answering
TLDR
A Probabilistic Graphical Model parameterized by the attention operations on each self-attention layer to incorporate sparse assumption is imposed and a PGM probability-based parsing algorithm is developed by which it can discover what the hidden structure of input is during the inference.
Learning to Reason on Tree Structures for Knowledge-Based Visual Question Answering
TLDR
This paper proposes a novel reasoning model of a question-guided tree structure with a knowledge base (QGTSKB) for addressing problems of collaborative reasoning in knowledge-based visual question answering and achieves superior performance over existing methods on the VQA v2.0 and CLVER dataset.
...
...

References

SHOWING 1-10 OF 32 REFERENCES
Neural Module Networks
TLDR
A procedure for constructing and learning neural module networks, which compose collections of jointly-trained neural "modules" into deep networks for question answering, and uses these structures to dynamically instantiate modular networks (with reusable components for recognizing dogs, classifying colors, etc.).
Learning to Compose Neural Networks for Question Answering
TLDR
A question answering model that applies to both images and structured knowledge bases that uses natural language strings to automatically assemble neural networks from a collection of composable modules that achieves state-of-the-art results on benchmark datasets.
Dynamic Memory Networks for Visual and Textual Question Answering
TLDR
The new DMN+ model improves the state of the art on both the Visual Question Answering dataset and the \babi-10k text question-answering dataset without supporting fact supervision.
Revisiting Visual Question Answering Baselines
TLDR
The results suggest that a key problem of current VQA systems lies in the lack of visual grounding and localization of concepts that occur in the questions and answers, and a simple alternative model based on binary classification is developed.
VQA: Visual Question Answering
We propose the task of free-form and open-ended Visual Question Answering (VQA). Given an image and a natural language question about the image, the task is to provide an accurate natural language
Stacked Attention Networks for Image Question Answering
TLDR
A multiple-layer SAN is developed in which an image is queried multiple times to infer the answer progressively, and the progress that the SAN locates the relevant visual clues that lead to the answer of the question layer-by-layer.
CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning
TLDR
This work presents a diagnostic dataset that tests a range of visual reasoning abilities and uses this dataset to analyze a variety of modern visual reasoning systems, providing novel insights into their abilities and limitations.
Modeling Relationships in Referential Expressions with Compositional Modular Networks
TLDR
This paper presents a modular deep architecture capable of analyzing referential expressions into their component parts, identifying entities and relationships mentioned in the input expression and grounding them all in the scene.
Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank
TLDR
A Sentiment Treebank that includes fine grained sentiment labels for 215,154 phrases in the parse trees of 11,855 sentences and presents new challenges for sentiment compositionality, and introduces the Recursive Neural Tensor Network.
Hierarchical Co-Attention for Visual Question Answering
TLDR
This paper presents a novel co-attention model for VQA that jointly reasons about image and question attention via a novel 1-dimensional convolution neural networks (CNN) model that outperforms all reported methods.
...
...