• Corpus ID: 230437876

NeurIPS 2020 EfficientQA Competition: Systems, Analyses and Lessons Learned

@inproceedings{Min2020NeurIPS2E,
  title={NeurIPS 2020 EfficientQA Competition: Systems, Analyses and Lessons Learned},
  author={Sewon Min and Jordan L. Boyd-Graber and Chris Alberti and Danqi Chen and Eunsol Choi and Michael Collins and Kelvin Guu and Hannaneh Hajishirzi and Kenton Lee and Jennimaria Palomaki and Colin Raffel and Adam Roberts and Tom Kwiatkowski and Patrick Lewis and Yuxiang Wu and Heinrich Kuttler and Linqing Liu and Pasquale Minervini and Pontus Stenetorp and Sebastian Riedel and Sohee Yang and Minjoon Seo and Gautier Izacard and Fabio Petroni and Lucas Hosseini and Nicola De Cao and Edouard Grave and Ikuya Yamada and Sonse Shimaoka and Masatoshi Suzuki and Shumpei Miyawaki and Shun Sato and Ryo Takahashi and Jun Suzuki and Martin Fajcik and Martin Docekal and Karel Ondrej and Pavel Smrz and Hao Cheng and Yelong Shen and Xiaodong Liu and Pengcheng He and Weizhu Chen and Jianfeng Gao and Barlas Oğuz and Xilun Chen and Vladimir Karpukhin and Stanislav Peshterliev and Dmytro Okhonko and M. Schlichtkrull and Sonal Gupta and Yashar Mehdad and Wen-tau Yih},
  booktitle={Neural Information Processing Systems},
  year={2020}
}
We review the EfficientQA competition from NeurIPS 2020. The competition focused on open-domain question answering (QA), where systems take natural language questions as input and return natural language answers. The aim of the competition was to build systems that can predict correct answers while also satisfying strict on-disk memory budgets. These memory budgets were designed to encourage contestants to explore the trade-off between storing large, redundant, retrieval corpora or the… 

PAQ: 65 Million Probably-Asked Questions and What You Can Do With Them

It is found that PAQ preempts and caches test questions, enabling RePAQ to match the accuracy of recent retrieve-and-read models, whilst being significantly faster, and a new QA-pair retriever, RePAZ, is introduced to complement PAQ.

Question and Answer Test-Train Overlap in Open-Domain Question Answering Datasets

A detailed study of the test sets of three popular open-domain benchmark datasets finds that 30% of test-set questions have a near-duplicate paraphrase in their corresponding train sets, and that simple nearest-neighbor models outperform a BART closed-book QA model.

A Survey for Efficient Open Domain Question Answering

This paper walks through the ODQA models and concludes the core techniques on efficiency, and Quantitative analysis on memory cost, processing speed, accuracy and overall comparison are given.

An Efficient Memory-Augmented Transformer for Knowledge-Intensive NLP Tasks

The Efficient Memory-Augmented Transformer (EMAT) is proposed – it encodes external knowledge into a key-value memory and exploits the fast maximum inner product search for memory querying and runs substantially faster across the board and produces more accurate results on WoW and ELI5.

Efficiently Controlling Multiple Risks with Pareto Testing

This work uncovers a hyper-parameter model that guarantees maximal error optimization of the auxiliary objectives inference re-training in multi-objective optimization.

Improving Question Answering with Generation of NQ-like Questions

An algorithm is proposed to automatically generate shorter questions resembling day-to-day human communication in the Natural Questions dataset from longer trivia questions in Quizbowl dataset by leveraging conversion in style among the datasets to improve the scal-ability of training data while maintaining quality of data for QA systems.

Investigating the Failure Modes of the AUC metric and Exploring Alternatives for Evaluating Systems in Safety Critical Applications

On experimenting with ten models, results show that newer and larger pre-trained models do not necessarily show better performance in selective answering, and three alternate metrics are proposed that could help develop better models tailored for safety-critical applications.

Bridging the Training-Inference Gap for Dense Phrase Retrieval

This work proposes an efficient way of validating dense retrievers using a small subset of the entire corpus to validate various training strategies including unifying contrastive loss terms and using hard negatives for phrase retrieval, which largely reduces the training-inference discrepancy.

Evaluate&Evaluation on the Hub: Better Best Practices for Data and Model Measurements

Evaluate is a library to support best practices for measurements, metrics, and comparisons of data and models, and Evaluation on the Hub is a platform that enables the large-scale evaluation of over 75,000 models and 11,000 datasets on the Hugging Face Hub, for free, at the click of a button.

MIA 2022 Shared Task: Evaluating Cross-lingual Open-Retrieval Question Answering for 16 Diverse Languages

The results of the Workshop on Multilingual Information Access 2022 Shared Task, evaluating cross-lingual open-retrieval question answering (QA) systems in 16 typologically diverse languages are presented, with the best system obtains particularly significant improvements in Tamil.

References

SHOWING 1-10 OF 64 REFERENCES

REALM: Retrieval-Augmented Language Model Pre-Training

The effectiveness of Retrieval-Augmented Language Model pre-training (REALM) is demonstrated by fine-tuning on the challenging task of Open-domain Question Answering (Open-QA) and is found to outperform all previous methods by a significant margin, while also providing qualitative benefits such as interpretability and modularity.

Generation-Augmented Retrieval for Open-Domain Question Answering

It is shown that generating diverse contexts for a query is beneficial as fusing their results consistently yields better retrieval accuracy, and as sparse and dense representations are often complementary, GAR can be easily combined with DPR to achieve even better performance.

Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering

Interestingly, it is observed that the performance of this method significantly improves when increasing the number of retrieved passages, evidence that sequence-to-sequence models offers a flexible framework to efficiently aggregate and combine evidence from multiple passages.

AmbigQA: Answering Ambiguous Open-domain Questions

This paper introduces AmbigQA, a new open-domain question answering task which involves predicting a set of question-answer pairs, where every plausible answer is paired with a disambiguated rewrite of the original question.

RoBERTa: A Robustly Optimized BERT Pretraining Approach

It is found that BERT was significantly undertrained, and can match or exceed the performance of every model published after it, and the best model achieves state-of-the-art results on GLUE, RACE and SQuAD.

Natural Questions: A Benchmark for Question Answering Research

The Natural Questions corpus, a question answering data set, is presented, introducing robust metrics for the purposes of evaluating question answering systems; demonstrating high human upper bounds on these metrics; and establishing baseline results using competitive methods drawn from related literature.

ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators

The contextual representations learned by the proposed replaced token detection pre-training task substantially outperform the ones learned by methods such as BERT and XLNet given the same model size, data, and compute.

Dense Passage Retrieval for Open-Domain Question Answering

This work shows that retrieval can be practically implemented using dense representations alone, where embeddings are learned from a small number of questions and passages by a simple dual-encoder framework.

How Much Knowledge Can You Pack into the Parameters of a Language Model?

It is shown that this approach scales surprisingly well with model size and outperforms models that explicitly look up knowledge on the open-domain variants of Natural Questions and WebQuestions.

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

This systematic study compares pre-training objectives, architectures, unlabeled datasets, transfer approaches, and other factors on dozens of language understanding tasks and achieves state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more.
...