Hurdles to Progress in Long-form Question Answering

@inproceedings{Krishna2021HurdlesTP,
  title={Hurdles to Progress in Long-form Question Answering},
  author={Kalpesh Krishna and Aurko Roy and Mohit Iyyer},
  booktitle={NAACL},
  year={2021}
}
The task of long-form question answering (LFQA) involves retrieving documents relevant to a given question and using them to generate a paragraph-length answer. While many models have recently been proposed for LFQA, we show in this paper that the task formulation raises fundamental challenges regarding evaluation and dataset creation that currently preclude meaningful modeling progress. To demonstrate these challenges, we first design a new system that relies on sparse attention and… 
Survey of Hallucination in Natural Language Generation
TLDR
This survey serves tofacilitate collaborative efforts among researchers in tackling the challenge of hallucinated texts in NLG by providing a broad overview of the research progress and challenges in the hallucination problem inNLG.
GooAQ: Open Question Answering with Diverse Answer Types
TLDR
GOOAQ is presented, a large-scale dataset collected from Google questions and answers, containing 3 million questions with diverse answer types ranging from factual short answers to snippets to collections, and it is shown that 94% of the mined answers are accurate, enabling fine-tuning a pre-trained language model for answering GOOAq questions.
iLFQA: A Platform for Efficient and Accurate Long-Form Question Answering
TLDR
iLFQA is presented as an open-domain, deployable, and accurate open-source long-form question answering platform and the source code and implementation details are made available for the benefit of researchers and practitioners in this field.
A Survey of Knowledge-Enhanced Text Generation
TLDR
A comprehensive review of the research on knowledge-enhanced text generation over the past five years is presented, which includes two parts: (i) general methods and architectures for integrating knowledge into text generation; (ii) specific techniques and applications according to different forms of knowledge data.
New Methods & Metrics for LFQA tasks
TLDR
This work addresses critical bottlenecks in LFQA modeling, contributing natural language inference/generation methods and metrics that make significant strides to their alleviation.
RELiC: Retrieving Evidence for Literary Claims
TLDR
A RoBERTa-based dense passage retriever is implemented for the novel task of literary evidence retrieval, in which models are given an excerpt of literary analysis surrounding a masked quotation and asked to retrieve the quoted passage from the set of all passages in the work.
Towards Human-Centred Explainability Benchmarks For Text Classification
TLDR
This position paper proposes to extend text classification benchmarks to evaluate the explainability of text classifiers and proposes to ground these benchmarks in human-centred applications, for example by using social media, gamification or to learn explainability metrics from human judgements.
Generation-focused Table-based Intermediate Pre-training for Free-form Question Answering
TLDR
An intermediate pre-training framework, Generation-focused Table-based Intermediate Pre-training (GENTAP), that jointly learns representations of natural language questions and tables that enhance the question understanding and table representation abilities for complex questions is presented.
SQuALITY: Building a Long-Document Summarization Dataset the Hard Way
TLDR
This work hires highly-qualified contractors to read stories and write original summaries from scratch, and uses this protocol to collect SQuALITY, a dataset of question-focused summaries built on the same public-domain short stories as the multiple-choice dataset QuALITY (Pang et al., 2021b).
RankGen: Improving Text Generation with Large Ranking Models
TLDR
Analysis reveals that R ANK G EN outputs are more relevant to the prefix and improve continuity and coherence compared to baselines, and the model checkpoints, code, and human preferences are open source.
...
...

References

SHOWING 1-10 OF 54 REFERENCES
Dense Passage Retrieval for Open-Domain Question Answering
TLDR
This work shows that retrieval can be practically implemented using dense representations alone, where embeddings are learned from a small number of questions and passages by a simple dual-encoder framework.
KILT: a Benchmark for Knowledge Intensive Language Tasks
TLDR
It is found that a shared dense vector index coupled with a seq2seq model is a strong baseline, outperforming more tailor-made approaches for fact checking, open-domain question answering and dialogue, and yielding competitive results on entity linking and slot filling, by generating disambiguated text.
Evaluation of Text Generation: A Survey
TLDR
This paper surveys evaluation methods of natural language generation (NLG) systems that have been developed in the last few years, with a focus on the evaluation of recently proposed NLG tasks and neural NLG models.
REALM: Retrieval-Augmented Language Model Pre-Training
TLDR
The effectiveness of Retrieval-Augmented Language Model pre-training (REALM) is demonstrated by fine-tuning on the challenging task of Open-domain Question Answering (Open-QA) and is found to outperform all previous methods by a significant margin, while also providing qualitative benefits such as interpretability and modularity.
ELI5: Long Form Question Answering
TLDR
This work introduces the first large-scale corpus for long form question answering, a task requiring elaborate and in-depth answers to open-ended questions, and shows that an abstractive model trained with a multi-task objective outperforms conventional Seq2Seq, language modeling, as well as a strong extractive baseline.
Natural Questions: A Benchmark for Question Answering Research
TLDR
The Natural Questions corpus, a question answering data set, is presented, introducing robust metrics for the purposes of evaluating question answering systems; demonstrating high human upper bounds on these metrics; and establishing baseline results using competitive methods drawn from related literature.
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
TLDR
A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.
Question and Answer Test-Train Overlap in Open-Domain Question Answering Datasets
TLDR
A detailed study of the test sets of three popular open-domain benchmark datasets finds that 30% of test-set questions have a near-duplicate paraphrase in their corresponding train sets, and that simple nearest-neighbor models outperform a BART closed-book QA model.
Pre-training via Paraphrasing
TLDR
It is shown that fine-tuning gives strong performance on a range of discriminative and generative tasks in many languages, making MARGE the most generally applicable pre-training method to date.
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
TLDR
A general-purpose fine-tuning recipe for retrieval-augmented generation (RAG) -- models which combine pre-trained parametric and non-parametric memory for language generation, and finds that RAG models generate more specific, diverse and factual language than a state-of-the-art parametric-only seq2seq baseline.
...
...