Are You Smarter Than a Sixth Grader? Textbook Question Answering for Multimodal Machine Comprehension

@article{Kembhavi2017AreYS,
  title={Are You Smarter Than a Sixth Grader? Textbook Question Answering for Multimodal Machine Comprehension},
  author={Aniruddha Kembhavi and Minjoon Seo and Dustin Schwenk and Jonghyun Choi and Ali Farhadi and Hannaneh Hajishirzi},
  journal={2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2017},
  pages={5376-5384}
}
We introduce the task of Multi-Modal Machine Comprehension (M3C), which aims at answering multimodal questions given a context of text, diagrams and images. [...] Key Method We extend state-of-the-art methods for textual machine comprehension and visual question answering to the TQA dataset. Our experiments show that these models do not perform well on TQA. The presented dataset opens new challenges for research in question answering and reasoning across multiple modalities.Expand
MoQA - A Multi-modal Question Answering Architecture
TLDR
The shortcomings of the model are discussed and the reason behind the large gap to human performance is shown, by exploring the distribution of the multiple classes of mistakes that the model makes.
Towards Solving Multimodal Comprehension
TLDR
This paper evaluates M3C using a textual cloze style question-answering task and highlights an inherent bias in the question answer generation method that enables a naive baseline to cheat by learning from only answer choices, and proposes an algorithm capable of modifying the given dataset to remove the bias elements.
Looking Beyond the Surface: A Challenge Set for Reading Comprehension over Multiple Sentences
TLDR
The dataset is the first to study multi-sentence inference at scale, with an open-ended set of question types that requires reasoning skills, and finds human solvers to achieve an F1-score of 88.1%.
Textbook Question Answering Under Instructor Guidance with Memory Networks
TLDR
This work proposes a novel approach of Instructor Guidance with Memory Networks (IGMN) which conducts the TQA task by finding contradictions between the candidate answers and their corresponding context, and builds the Contradiction Entity-Relationship Graph (CERG) to extend the passage-level multi-modal contradictions to an essay level.
2 RecipeQA Dataset The Recipe Question Answering ( RecipeQA ) dataset is a challenging multimodal dataset
Understanding and reasoning about cooking recipes is a fruitful research direction towards enabling machines to interpret procedural text. In this work, we introduce RecipeQA, a dataset for
ISAAQ - Mastering Textbook Questions with Pre-trained Transformers and Bottom-Up and Top-Down Attention
TLDR
This paper taps on the potential of transformer language models and bottom-up and top-down attention to tackle the language and visual understanding challenges this task entails, and relies on pre-trained transformers, fine-tuning and ensembling.
Hierarchical Multi-Task Learning for Diagram Question Answering with Multi-Modal Transformer
  • Zhaoquan Yuan, Xiao Peng, Xiao Wu, Changsheng Xu
  • Computer Science
    ACM Multimedia
  • 2021
TLDR
This paper proposes a novel structural parsing-integrated Hierarchical Multi-Task Learning (HMTL) model for diagram question answering based on a multi-modal transformer framework and demonstrates the effectiveness of the proposed HMTL over other state-of-the-art methods.
MoCA: Incorporating Multi-stage Domain Pretraining and Cross-guided Multimodal Attention for Textbook Question Answering
TLDR
A novel model named MoCA is proposed, which incorporates multi-stage domain pretraining and multimodal cross attention for the TQA task and proposes a heuristic generation algorithm to employ the terminology corpus.
Textbook Question Answering with Multi-modal Context Graph Understanding and Self-supervised Open-set Comprehension
TLDR
A novel algorithm for solving the textbook question answering (TQA) task is introduced which describes more realistic QA problems compared to other recent tasks and a novel self-supervised open-set learning process without any annotations is introduced.
Answering Questions about Data Visualizations using Efficient Bimodal Fusion
TLDR
This work proposes a novel CQA algorithm called parallel recurrent fusion of image and language (PReFIL), which first learns bimodal embeddings by fusing question and image features and then intelligently aggregates these learnedembeddings to answer the given question.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 33 REFERENCES
MCTest: A Challenge Dataset for the Open-Domain Machine Comprehension of Text
TLDR
MCTest is presented, a freely available set of stories and associated questions intended for research on the machine comprehension of text that requires machines to answer multiple-choice reading comprehension questions about fictional stories, directly tackling the high-level goal of open-domain machine comprehension.
Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks
TLDR
This work argues for the usefulness of a set of proxy tasks that evaluate reading comprehension via question answering, and classify these tasks into skill sets so that researchers can identify (and then rectify) the failings of their systems.
SQuAD: 100,000+ Questions for Machine Comprehension of Text
TLDR
A strong logistic regression model is built, which achieves an F1 score of 51.0%, a significant improvement over a simple baseline (20%).
Are You Talking to a Machine? Dataset and Methods for Multilingual Image Question
TLDR
The mQA model, which is able to answer questions about the content of an image, is presented, which contains four components: a Long Short-Term Memory (LSTM), a Convolutional Neural Network (CNN), an LSTM for storing the linguistic context in an answer, and a fusing component to combine the information from the first three components and generate the answer.
VQA: Visual Question Answering
We propose the task of free-form and open-ended Visual Question Answering (VQA). Given an image and a natural language question about the image, the task is to provide an accurate natural language
Dynamic Memory Networks for Visual and Textual Question Answering
TLDR
The new DMN+ model improves the state of the art on both the Visual Question Answering dataset and the \babi-10k text question-answering dataset without supporting fact supervision.
Bidirectional Attention Flow for Machine Comprehension
TLDR
The BIDAF network is introduced, a multi-stage hierarchical process that represents the context at different levels of granularity and uses bi-directional attention flow mechanism to obtain a query-aware context representation without early summarization.
MovieQA: Understanding Stories in Movies through Question-Answering
TLDR
The MovieQA dataset, which aims to evaluate automatic story comprehension from both video and text, is introduced and existing QA techniques are extended to show that question-answering with such open-ended semantics is hard.
A Thorough Examination of the CNN/Daily Mail Reading Comprehension Task
TLDR
A thorough examination of this new reading comprehension task by creating over a million training examples by pairing CNN and Daily Mail news articles with their summarized bullet points, and showing that a neural network can be trained to give good performance on this task.
Ask Your Neurons: A Neural-Based Approach to Answering Questions about Images
We address a question answering task on real-world images that is set up as a Visual Turing Test. By combining latest advances in image representation and natural language processing, we propose
...
1
2
3
4
...