WebQA: Multihop and Multimodal QA

  title={WebQA: Multihop and Multimodal QA},
  author={Yingshan Chang and Mridu Baldevraj Narang and Hisami Suzuki and Guihong Cao and Jianfeng Gao and Yonatan Bisk},
  journal={2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
Scaling Visual Question Answering (VQA) to the open-domain and multi-hop nature of web searches, requires fundamental advances in visual representation learning, knowledge aggregation, and language generation. In this work, we introduce WEBQA, a challenging new benchmark that proves difficult for large-scale state-of-the-art models which lack language groundable visual representations for novel objects and the ability to reason, yet trivial for humans. WebQA mirrors the way humans use the web… 

Figures and Tables from this paper

WebQA: A Multimodal Multihop NeurIPS Challenge

The challenge for the community is to create unified multimodal reasoning models that can answer questions regardless of the source modality, moving us closer to digital assistants that search through not only text-based knowledge, but also the richer visual trove of information.

Multimodal Multihop Source Retrieval for Web Question Answering

This work analyze the task of information source selection through the lens of Graph Convolution Neural Networks and proposes three independent methods which explore different graph arrangements and weighted connections across nodes that can effectively capture the multimodal and multihop aspects of information retrieval.

K-LITE: Learning Transferable Visual Models with External Knowledge

This paper proposes K-L ITE, a simple strategy to leverage external knowledge to build transferable visual systems, and proposes knowledge-augmented models that show signs of improvement in transfer learning performance over existing methods.

Universal Multi-Modality Retrieval with One Unified Embedding Space

Vision-Language Universal Search achieves the state-of-the-art on the multi-modality open-domain question answering benchmark, WebQA, and outperforms all retrieval models in each single modality task.

Modern Question Answering Datasets and Benchmarks: A Survey

Two of the most common QA tasks - textual question answer and visual question answering - are introduced separately, covering the most representative datasets, and some current challenges of QA research are given.

VL-InterpreT: An Interactive Visualization Tool for Interpreting Vision-Language Transformers

The functionalities of VL-InterpreT are demonstrated through the analysis of KD-VLP, an end-to-end pretraining vision-language multimodal transformer-based model, in the tasks of Visual Commonsense Reasoning (VCR) and WebQA, two visual question answering benchmarks.

Y UAN 1 . 0 : L ARGES

With this method, Yuan 1.0, the current largest singleton language model with 245B parameters, achieves excellent performance on thousands GPUs during training, and the state-of-the-art results on natural language processing tasks.

Un jeu de données pour répondre à des questions visuelles à propos d’entités nommées en utilisant des bases de connaissances

Dans le contexte général des traitements multimodaux, nous nous intéressons à la tâche de réponse à des questions visuelles à propos d’entités nommées en utilisant des bases de connaissances (KVQAE).

Re-Imagen: Retrieval-Augmented Text-to-Image Generator

The Retrieval-Augmented Text-to-Image Generator (Re-Imagen), a generative model that uses retrieved information to produce high-fidelity and faithful images, even for rare or unseen entities, is presented.



OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge

This paper addresses the task of knowledge-based visual question answering and provides a benchmark, called OK-VQA, where the image content is not sufficient to answer the questions, encouraging methods that rely on external knowledge resources.

MultiModalQA: Complex Question Answering over Text, Tables and Images

This paper creates MMQA, a challenging question answering dataset that requires joint reasoning over text, tables and images, and defines a formal language that allows it to take questions that can be answered from a single modality, and combine them to generate cross-modal questions.

Unifying Vision-and-Language Tasks via Text Generation

This work proposes a unified framework that learns different tasks in a single architecture with the same language modeling objective, i.e., multimodal conditional text generation, where the models learn to generate labels in text based on the visual and textual inputs.

VQA: Visual Question Answering

We propose the task of free-form and open-ended Visual Question Answering (VQA). Given an image and a natural language question about the image, the task is to provide an accurate natural language

An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA

This work proposes PICa, a simple yet effective method that Prompts GPT3 via the use of Image Captions, for knowledge-based VQA, and treats GPT-3 as an implicit and unstructured KB that can jointly acquire and process relevant knowledge.

ManyModalQA: Modality Disambiguation and QA over Diverse Inputs

A new multimodal question answering challenge, ManyModalQA, in which an agent must answer a question by considering three distinct modalities: text, images, and tables, is presented, with the expectation that existing datasets and approaches will be transferred for most of the training.

In Defense of Grid Features for Visual Question Answering

This paper revisits grid features for VQA, and finds they can work surprisingly well -- running more than an order of magnitude faster with the same accuracy (e.g. if pre-trained in a similar fashion).

Unified Vision-Language Pre-Training for Image Captioning and VQA

VLP is the first reported model that achieves state-of-the-art results on both vision-language generation and understanding tasks, as disparate as image captioning and visual question answering, across three challenging benchmark datasets: COCO Captions, Flickr30k Captions and VQA 2.0.

HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering

It is shown that HotpotQA is challenging for the latest QA systems, and the supporting facts enable models to improve performance and make explainable predictions.

Constructing Datasets for Multi-hop Reading Comprehension Across Documents

A novel task to encourage the development of models for text understanding across multiple documents and to investigate the limits of existing methods, in which a model learns to seek and combine evidence — effectively performing multihop, alias multi-step, inference.