• Corpus ID: 233219849

MultiModalQA: Complex Question Answering over Text, Tables and Images

@article{Talmor2021MultiModalQACQ,
  title={MultiModalQA: Complex Question Answering over Text, Tables and Images},
  author={Alon Talmor and Ori Yoran and Amnon Catav and Daniel Lahav and Yizhong Wang and Akari Asai and Gabriel Ilharco and Hannaneh Hajishirzi and Jonathan Berant},
  journal={ArXiv},
  year={2021},
  volume={abs/2104.06039}
}
When answering complex questions, people can seamlessly combine information from visual, textual and tabular sources. While interest in models that reason over multiple pieces of evidence has surged in recent years, there has been relatively little work on question answering models that reason across multiple modalities. In this paper, we present MULTIMODALQA (MMQA): a challenging question answering dataset that requires joint reasoning over text, tables and images. We create MMQA using a new… 
MuMuQA: Multimedia Multi-Hop News Question Answering via Cross-Media Knowledge Extraction and Grounding
TLDR
A new QA evaluation benchmark with 1,384 questions over news articles that require crossmedia grounding of objects in images onto text, and introduces a novel multimedia data augmentation framework, based on cross-media knowledge extraction and synthetic question-answer generation, to automatically augment data that can provide weak supervision for this task.
Multi-Instance Training for Question Answering Across Table and Linked Text
TLDR
The proposed MITQA, a new TextTableQA system that explicitly models the different but closely-related probability spaces of table row selection and text span selection, is proposed, achieving 21% absolute improvement on both EM and F1 scores over previous published results.
FeTaQA: Free-form Table Question Answering
TLDR
This work introduces FeTaQA, a new dataset with 10K Wikipediabased pairs that yields a more challenging table question answering setting because it requires generating free-form text answers after retrieval, inference, and integration of multiple discontinuous facts from a structured knowledge source.
WebQA: Multihop and Multimodal QA
TLDR
This work shows that existing multi-modal transformers and visual representations do not perform well on open-domain visual queries and proposes to bridge this gap between the natural language and computer vision communities with WEBQA.
Multi-modal Retrieval of Tables and Texts Using Tri-encoder Models
TLDR
This paper creates a new multi-modal dataset based on text and table datasets from related work and compares the retrieval performance of different encoding schemata to find that dense vector embeddings of transformer models outperform sparseembeddings on four out of six evaluation datasets.
SituatedQA: Incorporating Extra-Linguistic Contexts into QA
TLDR
This study introduces SITUATEDQA, an open-retrieval QA dataset where systems must produce the correct answer to a question given the temporal or geographical context, and shows that existing models struggle with producing answers that are frequently updated or from uncommon locations.
Turning Tables: Generating Examples from Semi-structured Tables for Endowing Language Models with Reasoning Skills
TLDR
This work proposes to leverage semi-structured tables, and automatically generate at scale questionparagraph pairs, where answering the question requires reasoning over multiple facts in the paragraph, and adds a pre-training step over this synthetic data, which includes examples that require 16 different reasoning skills.
QA Dataset Explosion: A Taxonomy of NLP Resources for Question Answering and Reading Comprehension
TLDR
The largest survey of the field to date of question answering and reading comprehension, providing an overview of the various formats and domains of the current resources, and highlighting the current lacunae for future work.
DUE: End-to-End Document Understanding Benchmark
Understanding documents with rich layouts plays a vital role in digitization and hyper-automation but remains a challenging topic in the NLP research community. Additionally, the lack of a commonly
MuSiQue: Multi-hop Questions via Single-hop Question Composition
TLDR
This work proposes a bottom-up semi-automatic process of constructing multihop question via composition of single-hop questions, and uses this process to construct a new multi-hop QA dataset, MuSiQue-Ans, which is challenging for state-of-the-art QA models.
...
1
2
...

References

SHOWING 1-10 OF 39 REFERENCES
HybridQA: A Dataset of Multi-Hop Question Answering over Tabular and Textual Data
TLDR
HybridQA is presented, a new large-scale question-answering dataset that requires reasoning on heterogeneous information and can serve as a challenging benchmark to study question answering withheterogeneous information.
ManyModalQA: Modality Disambiguation and QA over Diverse Inputs
TLDR
A new multimodal question answering challenge, ManyModalQA, in which an agent must answer a question by considering three distinct modalities: text, images, and tables, is presented, with the expectation that existing datasets and approaches will be transferred for most of the training.
Constructing Datasets for Multi-hop Reading Comprehension Across Documents
TLDR
A novel task to encourage the development of models for text understanding across multiple documents and to investigate the limits of existing methods, in which a model learns to seek and combine evidence — effectively performing multihop, alias multi-step, inference.
Towards VQA Models That Can Read
TLDR
A novel model architecture is introduced that reads text in the image, reasons about it in the context of the image and the question, and predicts an answer which might be a deduction based on the text and the image or composed of the strings found in the images.
GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering
We introduce GQA, a new dataset for real-world visual reasoning and compositional question answering, seeking to address key shortcomings of previous VQA datasets. We have developed a strong and
HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering
TLDR
It is shown that HotpotQA is challenging for the latest QA systems, and the supporting facts enable models to improve performance and make explainable predictions.
VQA: Visual Question Answering
We propose the task of free-form and open-ended Visual Question Answering (VQA). Given an image and a natural language question about the image, the task is to provide an accurate natural language
The Web as a Knowledge-Base for Answering Complex Questions
TLDR
This paper proposes to decompose complex questions into a sequence of simple questions, and compute the final answer from the sequence of answers, and empirically demonstrates that question decomposition improves performance from 20.8 precision@1 to 27.5 precision @1 on this new dataset.
Yin and Yang: Balancing and Answering Binary Visual Questions
TLDR
This paper addresses binary Visual Question Answering on abstract scenes as visual verification of concepts inquired in the questions by converting the question to a tuple that concisely summarizes the visual concept to be detected in the image.
DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs
TLDR
A new reading comprehension benchmark, DROP, which requires Discrete Reasoning Over the content of Paragraphs, and presents a new model that combines reading comprehension methods with simple numerical reasoning to achieve 51% F1.
...
1
2
3
4
...