Multi-CPR: A Multi Domain Chinese Dataset for Passage Retrieval

  title={Multi-CPR: A Multi Domain Chinese Dataset for Passage Retrieval},
  author={Dingkun Long and Qiong Gao and Kuan-sheng Zou and Guangwei Xu and Pengjun Xie and Rui Guo and Jianfeng Xu and Guanjun Jiang and Luxi Xing and P. Yang},
  journal={Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval},
  • Dingkun LongQiong Gao P. Yang
  • Published 7 March 2022
  • Computer Science
  • Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval
Passage retrieval is a fundamental task in information retrieval (IR) research, which has drawn much attention recently. In the English field, the availability of large-scale annotated dataset (e.g, MS MARCO) and the emergence of deep pre-trained language models (e.g, BERT) has resulted in a substantial improvement of existing passage retrieval systems. However, in the Chinese field, especially for specific domains, passage retrieval systems are still immature due to quality-annotated dataset… 

Figures and Tables from this paper

Retrieval Oriented Masking Pre-training Language Model for Dense Passage Retrieval

This work proposes alternative retrieval oriented masking (dubbed as ROM) strategy where more important tokens will have a higher probability of being masked out, to facilitate the language model pre-training process.

Disentangled Modeling of Domain and Relevance for Adaptable Dense Retrieval

A novel DR framework named Disentangled Dense Retrieval (DDR) is proposed to support effective and flexible domain adaptation for DR models and enables a flexible training paradigm in which REM is trained with supervision once and DAMs are trained with unsupervised data.



Dense Passage Retrieval for Open-Domain Question Answering

This work shows that retrieval can be practically implemented using dense representations alone, where embeddings are learned from a small number of questions and passages by a simple dual-encoder framework.

MS MARCO: A Human Generated MAchine Reading COmprehension Dataset

This new dataset is aimed to overcome a number of well-known weaknesses of previous publicly available datasets for the same task of reading comprehension and question answering, and is the most comprehensive real-world dataset of its kind in both quantity and quality.

Pre-training Methods in Information Retrieval

An overview of PTMs applied in different components of an IR system, including the retrieval component, the re-ranking component, and other components are presented, and some open challenges are discussed and several promising directions are highlighted.

Matches Made in Heaven: Toolkit and Large-Scale Datasets for Supervised Query Reformulation

This paper presents three large-scale query reformulation datasets, namely Diamond, Platinum and Gold datasets, based on the queries in the MS MARCO dataset, which are believed to be the first set of datasets for supervised query reformulating that offers perfect query reformulations for a large number of queries.

RocketQAv2: A Joint Training Method for Dense Passage Retrieval and Passage Re-ranking

A novel joint training approach for dense passage retrieval and passage reranking is proposed, where the dynamic listwise distillation is introduced, where a unified listwise training approach is designed for both the retriever and the re-ranker.

Adversarial Retriever-Ranker for dense text retrieval

Experimental results show that AR2 consistently and significantly outperforms existing dense retriever methods and achieves new state-of-the-art results on all of them.

Encoder Adaptation of Dense Passage Retrieval for Open-Domain Question Answering

Different combinations of DPR’s question and passage encoder learned from five benchmark QA datasets on both indomain and out-of-domain questions are inspected to answer the question how an in-distribution question/passage encoder would generalize if paired with an OOD passage/question encoder from another domain.

Unsupervised Corpus Aware Language Model Pre-training for Dense Passage Retrieval

Recent research demonstrates the effectiveness of using fine-tuned language models (LM) for dense retrieval. However, dense retrievers are hard to train, typically requiring heavily engineered

Efficient Passage Retrieval with Hashing for Open-domain Question Answering

BPR is a memory-efficient neural retrieval model that integrates a learning-to-hash technique into the state-of-the-art Dense Passage Retriever to represent the passage index using compact binary codes rather than continuous vectors.

Condenser: a Pre-training Architecture for Dense Retrieval

This paper proposes to pre-train towards dense encoder with a novel Transformer architecture, Condenser, where LM prediction CONditions on DENSE Representation improves over standard LM by large margins on various text retrieval and similarity tasks.