BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions

@inproceedings{Clark2019BoolQET,
  title={BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions},
  author={Christopher Clark and Kenton Lee and Ming-Wei Chang and Tom Kwiatkowski and Michael Collins and Kristina Toutanova},
  booktitle={North American Chapter of the Association for Computational Linguistics},
  year={2019}
}
In this paper we study yes/no questions that are naturally occurring --- meaning that they are generated in unprompted and unconstrained settings. [] Key Result It achieves 80.4% accuracy compared to 90% accuracy of human annotators (and 62% majority-baseline), leaving a significant gap for future work.

Figures and Tables from this paper

Transfer Learning on Natural YES/NO Questions

This work provides a simple and effective method to improve the model's inferring ability on Natural YES/NO Question and results on dataset BoolQ show this method is competitive with other recently published methods, which means transferring from the related datasets through multi-task learning in first stage can save more beneficial information about main task Natural Yes/NO.

New Protocols and Negative Results for Textual Entailment Data Collection

Four alternative protocols are proposed, each aimed at improving either the ease with which annotators can produce sound training examples or the quality and diversity of those examples, and it is observed that all four new protocols reduce previously observed issues with annotation artifacts.

Natural Perturbation for Robust Question Answering

It is found that when natural perturbations are moderately cheaper to create, it is more effective to train models using them: such models exhibit higher robustness and better generalization, while retaining performance on the original BoolQ dataset.

Collecting Entailment Data for Pretraining: New Protocols and Negative Results

Four alternative protocols are proposed, each aimed at improving either the ease with which annotators can produce sound training examples or the quality and diversity of those examples, and it is observed that all four of these interventions reduce previously observed issues with annotation artifacts.

DaNetQA: a yes/no Question Answering Dataset for the Russian Language

A reproducible approach to DaNetQA creation is presented and transfer learning methods for task and language transferring are investigated, using English to Russian translation together with multilingual language fine-tuning.

FQuAD2.0: French Question Answering and Learning When You Don’t Know

This work introduces FQuad2.0, which extends FQuAD with 17,000+ unanswerable questions, annotated adversarially, in order to be similar to answerable ones, and benchmarks several models with this dataset, finding the best model achieves a F1 score of 82.3% on this classification task.

FQuAD2.0: French Question Answering and knowing that you know nothing

This work introduces FQuad2.0, which extends FQuAD with 17,000+ unanswerable questions, annotated adversarially, in order to be similar to answerable ones, and benchmarks several models with this dataset, finding the best model achieves a F1 score of 82.3% on this classification task.

Understanding tables with intermediate pre-training

This work adapts TAPAS (Herzig et al., 2020), a table-based BERT model, to recognize entailment, and creates a balanced dataset of millions of automatically created training examples which are learned in an intermediate step prior to fine-tuning.

More Bang for Your Buck: Natural Perturbation for Robust Question Answering

It is found that when natural perturbations are moderately cheaper to create, it is more effective to train models using them: such models exhibit higher robustness and better generalization, while retaining performance on the original BoolQ dataset.

Training Question Answering Models from Synthetic Data

This work synthesizes questions and answers from a synthetic corpus generated by an 8.3 billion parameter GPT-2 model and is able to train state of the art question answering networks on entirely model-generated data that achieve higher accuracy than when using the SQuAD1.1 training set questions alone.
...

References

SHOWING 1-10 OF 42 REFERENCES

QuAC: Question Answering in Context

QuAC introduces challenges not found in existing machine comprehension datasets: its questions are often more open-ended, unanswerable, or only meaningful within the dialog context, as it shows in a detailed qualitative evaluation.

Know What You Don’t Know: Unanswerable Questions for SQuAD

SQuadRUn is a new dataset that combines the existing Stanford Question Answering Dataset (SQuAD) with over 50,000 unanswerable questions written adversarially by crowdworkers to look similar to answerable ones.

SQuAD: 100,000+ Questions for Machine Comprehension of Text

A strong logistic regression model is built, which achieves an F1 score of 51.0%, a significant improvement over a simple baseline (20%).

Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks

This work argues for the usefulness of a set of proxy tasks that evaluate reading comprehension via question answering, and classify these tasks into skill sets so that researchers can identify (and then rectify) the failings of their systems.

TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension

It is shown that, in comparison to other recently introduced large-scale datasets, TriviaQA has relatively complex, compositional questions, has considerable syntactic and lexical variability between questions and corresponding answer-evidence sentences, and requires more cross sentence reasoning to find answers.

Annotation Artifacts in Natural Language Inference Data

It is shown that a simple text categorization model can correctly classify the hypothesis alone in about 67% of SNLI and 53% of MultiNLI, and that specific linguistic phenomena such as negation and vagueness are highly correlated with certain inference classes.

GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding

A benchmark of nine diverse NLU tasks, an auxiliary dataset for probing models for understanding of specific linguistic phenomena, and an online platform for evaluating and comparing models, which favors models that can represent linguistic knowledge in a way that facilitates sample-efficient learning and effective knowledge-transfer across tasks.

Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference

There is substantial room for improvement in NLI systems, and the HANS dataset can motivate and measure progress in this area, which contains many examples where the heuristics fail.

CoQA: A Conversational Question Answering Challenge

CoQA is introduced, a novel dataset for building Conversational Question Answering systems and it is shown that conversational questions have challenging phenomena not present in existing reading comprehension datasets (e.g., coreference and pragmatic reasoning).

Sentence Encoders on STILTs: Supplementary Training on Intermediate Labeled-data Tasks

The benefits of supplementary training with further training on data-rich supervised tasks, such as natural language inference, obtain additional performance improvements on the GLUE benchmark, as well as observing reduced variance across random restarts in this setting.