• Corpus ID: 245502964

A Survey on non-English Question Answering Dataset

  title={A Survey on non-English Question Answering Dataset},
  author={Andrea Chandra and Affandy Fahrizain and Ibrahim and Simon Willyanto Laufried},
Research in question answering datasets and models has gained a lot of attention in the research community. Many of them release their own question answering datasets as well as the models. There is tremendous progress that we have seen in this area of research. The aim of this survey is to recognize, summarize and analyze the existing datasets that have been released by many researchers, especially in nonEnglish datasets as well as resources such as research code, and evaluation metrics. In… 

Tables from this paper


A review of public datasets in question answering research
This work surveys the available datasets and provides a simple, multi-faceted classification of those datasets and also provides a wishlist of datasets whose release could benefit question answering research in the future.
FQuAD: French Question Answering Dataset
The present work introduces the French Question Answering Dataset (FQuAD), a French Native Reading Comprehension dataset of questions and answers on a set of Wikipedia articles that consists of 25,000+ samples for the 1.0 and 1.1 versions.
GermanQuAD and GermanDPR: Improving Non-English Question Answering and Passage Retrieval
This paper presents GermanQuAD, a dataset of 13,722 extractive question/answer pairs, to improve the reproducibility of the dataset creation approach and foster QA research on other languages, and summarizes lessons learned and evaluates reformulation of question/ answer pairs as a way to speed up the annotation process.
QA Dataset Explosion: A Taxonomy of NLP Resources for Question Answering and Reading Comprehension
The largest survey of the field to date of question answering and reading comprehension, providing an overview of the various formats and domains of the current resources, and highlighting the current lacunae for future work.
MKQA: A Linguistically Diverse Benchmark for Multilingual Open Domain Question Answering
Multilingual Knowledge Questions and Answers is introduced, an open- domain question answering evaluation set comprising 10k question-answer pairs aligned across 26 typologically diverse languages, making results comparable across languages and independent of language-specific passages.
RuBQ 2.0: An Innovated Russian Question Answering Dataset
The second version of RuBQ, a Russian dataset for knowledge base question answering (KBQA) over Wikidata, is described and is suitable for the evaluation of KBQA, machine reading comprehension (MRC), hybrid questions answering, as well as semantic parsing.
Neural Learning for Question Answering in Italian
This paper explores the possibility of acquiring a large scale although lower quality dataset for an open-domain factoid questions answering system in Italian and describes the dataset and the experiments, inspired by an equivalent counterpart for English.
XQA: A Cross-lingual Open-domain Question Answering Dataset
A novel dataset XQA is constructed that consists of a training set in English as well as development and test sets in eight other languages and provides several baseline systems for cross-lingual OpenQA, showing that the multilingual BERT model achieves the best results in almost all target languages.
PeCoQ: A Dataset for Persian Complex Question Answering over Knowledge Graph
This paper introduces PeCoQ, a dataset for Persian question answering that contains 10,000 complex questions and answers extracted from the Persian knowledge graph, FarsBase, and discusses the dataset's characteristics and describes the methodolozv for building it.
RuBQ: A Russian Dataset for Question Answering over Wikidata
RuBQ, the first Russian knowledge base question answering (KBQA) dataset, is presented, which consists of 1,500 Russian questions of varying complexity, their English machine translations, SPARQL queries to Wikidata, reference answers, as well as aWikidata sample of triples containing entities with Russian labels.