• Corpus ID: 245218982

QuALITY: Question Answering with Long Input Texts, Yes!

  title={QuALITY: Question Answering with Long Input Texts, Yes!},
  author={Richard Yuanzhe Pang and Alicia Parrish and Nitish Joshi and Nikita Nangia and Jason Phang and Angelica Chen and Vishakh Padmakumar and Johnny Ma and Jana Thompson and He He and Sam Bowman},
To enable building and testing models on long-document comprehension, we introduce QuALITY, a multiple-choice QA dataset with context passages in English that have an average length of about 5,000 tokens, much longer than typical current models can process. Unlike in prior work with passages, our questions are written and validated by contributors who have read the entire passage, rather than relying on summaries or excerpts. In addition, only half of the questions are answerable by annotators… 
Utilizing Evidence Spans via Sequence-Level Contrastive Learning for Long-Context Question Answering
This work proposes a novel method for equipping long-range transformers with an additional sequence-level objective for better identification of supporting evidence spans by proposing an additional contrastive supervision signal in finetuning.
MuLD: The Multitask Long Document Benchmark
MuLD is presented: a new long document benchmark consisting of only documents over 10,000 tokens, which requires models to successfully model long-term dependencies in the text and shows that models with increased context length are better able to solve the tasks presented.
Teaching language models to support answers with verified quotes
This work uses reinforcement learning from human preferences to train “open-book” QA models that generate answers whilst also citing specific evidence for their claims, which aids in the appraisal of correctness.
The NLP Task Effectiveness of Long-Range Transformers
It is found that attention of long-range transformers has advantages on content selection and query-guided decoding, but they come with previously unrecognized drawbacks such as insufficient attention to distant tokens.
Unifying Language Learning Paradigms
UL2 achieves SOTA performance on 50 well-established supervised NLP tasks ranging from language generation, language understanding, text classification, question answering, commonsense reasoning, long text reasoning, structured knowledge grounding and information retrieval.
Token Dropping for Efficient BERT Pretraining
A simple but effective “token dropping” method is developed to accelerate the pretraining of transformer models, such as BERT, without degrading its performance on downstream tasks.
SQuALITY: Building a Long-Document Summarization Dataset the Hard Way
Summarization datasets are often assembled either by scraping naturally occurring public-domain summaries—which are nearly always in difficult-to-work-with technical domains— or by using approximate
Single-Turn Debate Does Not Help Humans Answer Hard Reading-Comprehension Questions
It is found that explanations in the set-up improve human accuracy, but a baseline condition shows that providing human-selected text snippets does improve accuracy.


ELI5: Long Form Question Answering
This work introduces the first large-scale corpus for long form question answering, a task requiring elaborate and in-depth answers to open-ended questions, and shows that an abstractive model trained with a multi-task objective outperforms conventional Seq2Seq, language modeling, as well as a strong extractive baseline.
SCROLLS: Standardized CompaRison Over Long Language Sequences
This work introduces SCROLLS, a suite of tasks that require reasoning over long texts, and examines existing long-text datasets, and handpick ones where the text is naturally long, while prioritizing tasks that involve synthesizing information across the input.
QA Dataset Explosion: A Taxonomy of NLP Resources for Question Answering and Reading Comprehension
The largest survey of the field to date of question answering and reading comprehension, providing an overview of the various formats and domains of the current resources, and highlighting the current lacunae for future work.
Quiz-Style Question Generation for News Stories
This work proposes a series of novel techniques for applying large pre-trained Transformer encoder-decoder models, namely PEGASUS and T5, to the tasks of question-answer generation and distractor generation, and shows that these models outperform strong baselines using both automated metrics and human raters.
HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering
It is shown that HotpotQA is challenging for the latest QA systems, and the supporting facts enable models to improve performance and make explainable predictions.
Asking and Answering Questions to Evaluate the Factual Consistency of Summaries
QAGS (pronounced “kags”), an automatic evaluation protocol that is designed to identify factual inconsistencies in a generated summary, is proposed and is believed to be a promising tool in automatically generating usable and factually consistent text.
TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension
It is shown that, in comparison to other recently introduced large-scale datasets, TriviaQA has relatively complex, compositional questions, has considerable syntactic and lexical variability between questions and corresponding answer-evidence sentences, and requires more cross sentence reasoning to find answers.
The NarrativeQA Reading Comprehension Challenge
A new dataset and set of tasks in which the reader must answer questions about stories by reading entire books or movie scripts are presented, designed so that successfully answering their questions requires understanding the underlying narrative rather than relying on shallow pattern matching or salience.
Know What You Don’t Know: Unanswerable Questions for SQuAD
SQuadRUn is a new dataset that combines the existing Stanford Question Answering Dataset (SQuAD) with over 50,000 unanswerable questions written adversarially by crowdworkers to look similar to answerable ones.
SearchQA: A New Q&A Dataset Augmented with Context from a Search Engine
It is shown that there is a meaningful gap between the human and machine performances, which suggests that the proposed dataset could well serve as a benchmark for question-answering.