An Approach for Weakly-Supervised Deep Information Retrieval
@article{MacAvaney2017AnAF, title={An Approach for Weakly-Supervised Deep Information Retrieval}, author={Sean MacAvaney and Kai Hui and Andrew Yates}, journal={ArXiv}, year={2017}, volume={abs/1707.00189} }
Recent developments in neural information retrieval models have been promising, but a problem remains: human relevance judgments are expensive to produce, while neural models require a considerable amount of training data. In an attempt to fill this gap, we present an approach that---given a weak training set of pseudo-queries, documents, relevance information---filters the data to produce effective positive and negative query-document pairs. This allows large corpora to be used as neural IR…
17 Citations
Selective Weak Supervision for Neural Information Retrieval
- 2020
Computer Science
WWW
The classic IR intuition that anchor-document relations approximate query-document relevance is revisited and a reinforcement weak supervision selection method, ReInfoSelect, which learns to select anchor- document pairs that best weakly supervise the neural ranker (action), using the ranking performance on a handful of relevance labels as the reward.
Investigating Weak Supervision in Deep Ranking
- 2019
Computer Science
Data Inf. Manag.
A cascade ranking framework is proposed to combine the two weakly supervised relevance, which significantly promotes the ranking performance of neural ranking models and outperforms the best result in the last NTCIR-13 The authors Want Web (WWW) task.
Sogou-QCL: A New Dataset with Click Relevance Label
- 2018
Computer Science
SIGIR
A new dataset, Sogou-QCL, is presented, which contains 537,366 queries and five kinds of weak relevance labels for over 12 million query-document pairs and is applied to train recent neural ranking models and shows its potential to serve as weak supervision for ranking.
Embedding-based Zero-shot Retrieval through Query Generation
- 2020
Computer Science
ArXiv
This work considers the embedding-based two-tower architecture as the neural retrieval model and proposes a novel method for generating synthetic training data for retrieval, which produces remarkable results, significantly outperforming BM25 on 5 out of 6 datasets tested.
Context-aware ranking refinement with attentive semi-supervised autoencoders
- 2022
Computer Science
Soft Computing
This work proposes an attentive semi-supervised autoencoder to refine the ranked results using an optimized ranking-oriented reconstruction loss and devise the hybrid listwise query constraints to capture the characteristics of relevant documents for different queries.
Neural Ranking with Weak Supervision for Open-Domain Question Answering : A Survey
- 2023
Computer Science
FINDINGS
This work provides a structured overview of standard WS signals used for training a NR model and divides them into three main categories, based on their required resources, which are summarized in this work.
Low-Resource Dense Retrieval for Open-Domain Question Answering: A Comprehensive Survey
- 2022
Computer Science
ArXiv
A thorough structured overview of mainstream techniques for low-resource DR is provided, which divides the techniques into three main categories: only documents are needed; only documents and questions are needs; and (3) documents and question-answer pairs are needed.
Learning Domain‐specific Semantic Representation from Weakly Supervised Data to Improve Research Dataset Retrieval
- 2022
Computer Science
ASIST
This work investigated the use of semantically rich information to retrieve relevant datasets and the benefits of using domain‐specific dense vector representation as opposed to general representation, and proposed a fine‐tuned model that can improve the NDCG@10 score.
Overcoming low-utility facets for complex answer retrieval
- 2018
Computer Science
Information Retrieval Journal
This work proposes two estimators of facet utility: the hierarchical structure of CAR queries, and facet frequency information from training data, and includes entity similarity scores using embeddings trained from a CAR knowledge graph, which captures the context of facets.
Overcoming low-utility facets for complex answer retrieval
- 2018
Computer Science
Information Retrieval Journal
This work proposes two estimators of facet utility: the hierarchical structure of CAR queries, and facet frequency information from training data, and includes entity similarity scores using embeddings trained from a CAR knowledge graph, which captures the context of facets.
18 References
Neural Ranking Models with Weak Supervision
- 2017
Computer Science
SIGIR
This paper proposes to train a neural ranking model using weak supervision, where labels are obtained automatically without human annotators or any external resources, and suggests that supervised neural ranking models can greatly benefit from pre-training on large amounts of weakly labeled data that can be easily obtained from unsupervised IR models.
A Position-Aware Deep Model for Relevance Matching in Information Retrieval
- 2017
Computer Science
ArXiv
This work presents a novel model architecture consisting of convolutional layers to capture term dependencies and proximity among query term occurrences, followed by a recurrent layer to capture relevance over dierent query terms.
A Deep Relevance Matching Model for Ad-hoc Retrieval
- 2016
Computer Science
CIKM
A novel deep relevance matching model (DRMM) for ad-hoc retrieval that employs a joint deep architecture at the query term level for relevance matching and can significantly outperform some well-known retrieval models as well as state-of-the-art deep matching models.
PACRR: A Position-Aware Neural IR Model for Relevance Matching
- 2017
Computer Science
EMNLP
This work proposes a novel neural IR model named PACRR aiming at better modeling position-dependent interactions between a query and a document and yields better results under multiple benchmarks.
Learning to Match using Local and Distributed Representations of Text for Web Search
- 2017
Computer Science
WWW
This work proposes a novel document ranking model composed of two separate deep neural networks, one that matches the query and the document using a local representation, and another that Matching with distributed representations complements matching with traditional local representations.
End-to-End Neural Ad-hoc Ranking with Kernel Pooling
- 2017
Computer Science
SIGIR
K-NRM uses a translation matrix that models word-level similarities via word embeddings, a new kernel-pooling technique that uses kernels to extract multi-level soft match features, and a learning-to-rank layer that combines those features into the final ranking score.
Pseudo test collections for learning web search ranking functions
- 2011
Computer Science
SIGIR
Experiments carried out on TREC web track data show that learning to rank models trained using pseudo test collections outperform an unsupervised ranking function and are statistically indistinguishable from a model trained using manual judgments, demonstrating the usefulness of the approach in extracting reasonable quality training data "for free".
Cumulated gain-based evaluation of IR techniques
- 2002
Computer Science
TOIS
This article proposes several novel measures that compute the cumulative gain the user obtains by examining the retrieval result up to a given ranked position, and test results indicate that the proposed measures credit IR methods for their ability to retrieve highly relevant documents and allow testing of statistical significance of effectiveness differences.
A Study of MatchPyramid Models on Ad-hoc Retrieval
- 2016
Computer Science
ArXiv
The MatchPyramid models can significantly outperform several recently introduced deep matching models on the retrieval task, but still cannot compete with the traditional retrieval models, such as BM25 and language models.
Pseudo test collections for training and tuning microblog rankers
- 2013
Computer Science
SIGIR
This work describes a method for generating queries and relevance judgments for microblog search in an unsupervised way, and uses pseudo test collections as training sets in a learning to rank scenario.