Are You Smarter Than a Sixth Grader? Textbook Question Answering for Multimodal Machine Comprehension

  title={Are You Smarter Than a Sixth Grader? Textbook Question Answering for Multimodal Machine Comprehension},
  author={Aniruddha Kembhavi and Minjoon Seo and Dustin Schwenk and Jonghyun Choi and Ali Farhadi and Hannaneh Hajishirzi},
  journal={2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
We introduce the task of Multi-Modal Machine Comprehension (M3C), which aims at answering multimodal questions given a context of text, diagrams and images. [] Key Method We extend state-of-the-art methods for textual machine comprehension and visual question answering to the TQA dataset. Our experiments show that these models do not perform well on TQA. The presented dataset opens new challenges for research in question answering and reasoning across multiple modalities.

Figures and Tables from this paper

MoQA - A Multi-modal Question Answering Architecture

The shortcomings of the model are discussed and the reason behind the large gap to human performance is shown, by exploring the distribution of the multiple classes of mistakes that the model makes.

Towards Solving Multimodal Comprehension

This paper evaluates M3C using a textual cloze style question-answering task and highlights an inherent bias in the question answer generation method that enables a naive baseline to cheat by learning from only answer choices, and proposes an algorithm capable of modifying the given dataset to remove the bias elements.

Looking Beyond the Surface: A Challenge Set for Reading Comprehension over Multiple Sentences

The dataset is the first to study multi-sentence inference at scale, with an open-ended set of question types that requires reasoning skills, and finds human solvers to achieve an F1-score of 88.1%.

Textbook Question Answering Under Instructor Guidance with Memory Networks

This work proposes a novel approach of Instructor Guidance with Memory Networks (IGMN) which conducts the TQA task by finding contradictions between the candidate answers and their corresponding context, and builds the Contradiction Entity-Relationship Graph (CERG) to extend the passage-level multi-modal contradictions to an essay level.

XTQA: Span-Level Explanations of the Textbook Question Answering

A novel architecture towards span-level eXplanations of the TQA (XTQA) is devised based on a proposed coarse-to-fine grained algorithm, which can provide not only the answers but also the span- level evidences to choose them for students.

2 RecipeQA Dataset The Recipe Question Answering ( RecipeQA ) dataset is a challenging multimodal dataset

This work introduces RecipeQA, a dataset for multimodal comprehension of cooking recipes, a set of comprehension and reasoning tasks that require joint understanding of images and text, capturing the temporal flow of events and making sense of procedural knowledge.

ISAAQ - Mastering Textbook Questions with Pre-trained Transformers and Bottom-Up and Top-Down Attention

This paper taps on the potential of transformer language models and bottom-up and top-down attention to tackle the language and visual understanding challenges this task entails, and relies on pre-trained transformers, fine-tuning and ensembling.

Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering

This work designs language models to learn to generate lectures and explanations as the chain of thought (CoT) to mimic the multi-hop reasoning process when answering S CIENCE QA questions and explores the upper bound of GPT-3 and shows that CoT helps language models learn from fewer data.

MoCA: Incorporating Multi-stage Domain Pretraining and Cross-guided Multimodal Attention for Textbook Question Answering

A novel model named MoCA is proposed, which incorporates multi-stage domain pretraining and multimodal cross attention for the TQA task and proposes a heuristic generation algorithm to employ the terminology corpus.

Relation-Aware Fine-Grained Reasoning Network for Textbook Question Answering

A relation-aware fine-grained reasoning (RAFR) network that performs fine- grained reasoning over the nodes of relation-based diagram graphs and applies graph attention networks to learn diagram representations is proposed.



MCTest: A Challenge Dataset for the Open-Domain Machine Comprehension of Text

MCTest is presented, a freely available set of stories and associated questions intended for research on the machine comprehension of text that requires machines to answer multiple-choice reading comprehension questions about fictional stories, directly tackling the high-level goal of open-domain machine comprehension.

Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks

This work argues for the usefulness of a set of proxy tasks that evaluate reading comprehension via question answering, and classify these tasks into skill sets so that researchers can identify (and then rectify) the failings of their systems.

SQuAD: 100,000+ Questions for Machine Comprehension of Text

A strong logistic regression model is built, which achieves an F1 score of 51.0%, a significant improvement over a simple baseline (20%).

Are You Talking to a Machine? Dataset and Methods for Multilingual Image Question

The mQA model, which is able to answer questions about the content of an image, is presented, which contains four components: a Long Short-Term Memory (LSTM), a Convolutional Neural Network (CNN), an LSTM for storing the linguistic context in an answer, and a fusing component to combine the information from the first three components and generate the answer.

VQA: Visual Question Answering

We propose the task of free-form and open-ended Visual Question Answering (VQA). Given an image and a natural language question about the image, the task is to provide an accurate natural language

Dynamic Memory Networks for Visual and Textual Question Answering

The new DMN+ model improves the state of the art on both the Visual Question Answering dataset and the \babi-10k text question-answering dataset without supporting fact supervision.

Bidirectional Attention Flow for Machine Comprehension

The BIDAF network is introduced, a multi-stage hierarchical process that represents the context at different levels of granularity and uses bi-directional attention flow mechanism to obtain a query-aware context representation without early summarization.

A Thorough Examination of the CNN/Daily Mail Reading Comprehension Task

A thorough examination of this new reading comprehension task by creating over a million training examples by pairing CNN and Daily Mail news articles with their summarized bullet points, and showing that a neural network can be trained to give good performance on this task.

Ask Your Neurons: A Neural-Based Approach to Answering Questions about Images

We address a question answering task on real-world images that is set up as a Visual Turing Test. By combining latest advances in image representation and natural language processing, we propose

Visual7W: Grounded Question Answering in Images

A semantic link between textual descriptions and image regions by object-level grounding enables a new type of QA with visual answers, in addition to textual answers used in previous work, and proposes a novel LSTM model with spatial attention to tackle the 7W QA tasks.