ELI5: Long Form Question Answering

  title={ELI5: Long Form Question Answering},
  author={Angela Fan and Yacine Jernite and Ethan Perez and David Grangier and Jason Weston and Michael Auli},
We introduce the first large-scale corpus for long-form question answering, a task requiring elaborate and in-depth answers to open-ended questions. [] Key Method We provide a large set of web documents to help answer the question. Automatic and human evaluations show that an abstractive model trained with a multi-task objective outperforms conventional Seq2Seq, language modeling, as well as a strong extractive baseline. However, our best model is still far from human performance since raters prefer gold…

Figures and Tables from this paper

NLQuAD: A Non-Factoid Long Question Answering Data Set

NLQuAD’s samples exceed the input limitation of most pre-trained Transformer-based models, encouraging future research on long sequence language models and shows that Longformer outperforms the other architectures, but results are still far behind a human upper bound.

Question Answering with Long Multiple-Span Answers

This work presents MASH-QA, a Multiple Answer Spans Healthcare Question Answering dataset from the consumer health domain, and proposes MultiCo, a neural architecture that is able to capture the relevance among multiple answer spans, by using a query-based contextualized sentence selection approach, for forming the answer to the given question.

Hurdles to Progress in Long-form Question Answering

The task formulation raises fundamental challenges regarding evaluation and dataset creation that currently preclude meaningful modeling progress, and a new system that relies on sparse attention and contrastive retriever learning to achieve state-of-the-art performance on the ELI5 LFQA dataset is designed.

Query Refinement Prompts for Closed-Book Long-Form Question Answering

Query refinement prompts are defined that encourage LLMs to explicitly express the multifacetedness in questions and generate long-form answers covering multiple facets of the question to outperform fully finetuned models in the closed book setting, as well as achieve results comparable to retrieve-then-generate open-book models.

GooAQ: Open Question Answering with Diverse Answer Types

GOOAQ is presented, a large-scale dataset collected from Google questions and answers, containing 3 million questions with diverse answer types ranging from factual short answers to snippets to collections, and it is shown that 94% of the mined answers are accurate, enabling fine-tuning a pre-trained language model for answering GOOAq questions.

QuALITY: Question Answering with Long Input Texts, Yes!

QuALITY, a multiple-choice QA dataset with context passages in English that have an average length of about 5,000 tokens, much longer than typical current models can process, is introduced to enable building and testing models on long-document comprehension.

How Do We Answer Complex Questions: Discourse Structure of Long-form Answers

An ontology of six sentence-level functional roles for long-form answers is developed, finding that annotators agree less with each other when annotating model-generated answers compared to annotating human-written answers.

AnswerSumm: A Manually-Curated Dataset and Pipeline for Answer Summarization

This work introduces a novel dataset of 4,631 CQA threads for answer summarization curated by professional linguists and introduces a Novel unsupervised approach for multi-perspective data augmentation that boosts summarization performance according to automatic evaluation.

CCQA: A New Web-Scale Question Answering Dataset for Model Pre-Training

A novel open-domain question-answering dataset based on the Common Crawl project that achieves promising results in zero-shot, low resource, and tuned settings across multiple tasks, models and benchmarks is proposed.

Generation-Focused Table-Based Intermediate Pre-training for Free-Form Question Answering

An intermediate pre-training framework, Generation-focused Table-based Intermediate Pre-training (GENTAP), that jointly learns representations of natural language questions and tables that enhance the question understanding and table representation abilities for complex questions is presented.



QuAC: Question Answering in Context

QuAC introduces challenges not found in existing machine comprehension datasets: its questions are often more open-ended, unanswerable, or only meaningful within the dialog context, as it shows in a detailed qualitative evaluation.

SQuAD: 100,000+ Questions for Machine Comprehension of Text

A strong logistic regression model is built, which achieves an F1 score of 51.0%, a significant improvement over a simple baseline (20%).

Reading Wikipedia to Answer Open-Domain Questions

This approach combines a search component based on bigram hashing and TF-IDF matching with a multi-layer recurrent neural network model trained to detect answers in Wikipedia paragraphs, indicating that both modules are highly competitive with respect to existing counterparts.

CoQA: A Conversational Question Answering Challenge

CoQA is introduced, a novel dataset for building Conversational Question Answering systems and it is shown that conversational questions have challenging phenomena not present in existing reading comprehension datasets (e.g., coreference and pragmatic reasoning).

HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering

It is shown that HotpotQA is challenging for the latest QA systems, and the supporting facts enable models to improve performance and make explainable predictions.

TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension

It is shown that, in comparison to other recently introduced large-scale datasets, TriviaQA has relatively complex, compositional questions, has considerable syntactic and lexical variability between questions and corresponding answer-evidence sentences, and requires more cross sentence reasoning to find answers.

SearchQA: A New Q&A Dataset Augmented with Context from a Search Engine

It is shown that there is a meaningful gap between the human and machine performances, which suggests that the proposed dataset could well serve as a benchmark for question-answering.

NewsQA: A Machine Comprehension Dataset

NewsQA, a challenging machine comprehension dataset of over 100,000 human-generated question-answer pairs, is presented and analysis confirms that NewsQA demands abilities beyond simple word matching and recognizing textual entailment.

Know What You Don’t Know: Unanswerable Questions for SQuAD

SQuadRUn is a new dataset that combines the existing Stanford Question Answering Dataset (SQuAD) with over 50,000 unanswerable questions written adversarially by crowdworkers to look similar to answerable ones.

Bidirectional Attention Flow for Machine Comprehension

The BIDAF network is introduced, a multi-stage hierarchical process that represents the context at different levels of granularity and uses bi-directional attention flow mechanism to obtain a query-aware context representation without early summarization.