An Approach for Weakly-Supervised Deep Information Retrieval

  title={An Approach for Weakly-Supervised Deep Information Retrieval},
  author={Sean MacAvaney and Kai Hui and Andrew Yates},
Recent developments in neural information retrieval models have been promising, but a problem remains: human relevance judgments are expensive to produce, while neural models require a considerable amount of training data. In an attempt to fill this gap, we present an approach that---given a weak training set of pseudo-queries, documents, relevance information---filters the data to produce effective positive and negative query-document pairs. This allows large corpora to be used as neural IR… 

Tables from this paper

Selective Weak Supervision for Neural Information Retrieval

The classic IR intuition that anchor-document relations approximate query-document relevance is revisited and a reinforcement weak supervision selection method, ReInfoSelect, which learns to select anchor- document pairs that best weakly supervise the neural ranker (action), using the ranking performance on a handful of relevance labels as the reward.

Investigating Weak Supervision in Deep Ranking

A cascade ranking framework is proposed to combine the two weakly supervised relevance, which significantly promotes the ranking performance of neural ranking models and outperforms the best result in the last NTCIR-13 The authors Want Web (WWW) task.

Sogou-QCL: A New Dataset with Click Relevance Label

A new dataset, Sogou-QCL, is presented, which contains 537,366 queries and five kinds of weak relevance labels for over 12 million query-document pairs and is applied to train recent neural ranking models and shows its potential to serve as weak supervision for ranking.

Embedding-based Zero-shot Retrieval through Query Generation

This work considers the embedding-based two-tower architecture as the neural retrieval model and proposes a novel method for generating synthetic training data for retrieval, which produces remarkable results, significantly outperforming BM25 on 5 out of 6 datasets tested.

Context-aware ranking refinement with attentive semi-supervised autoencoders

This work proposes an attentive semi-supervised autoencoder to refine the ranked results using an optimized ranking-oriented reconstruction loss and devise the hybrid listwise query constraints to capture the characteristics of relevant documents for different queries.

Neural Ranking with Weak Supervision for Open-Domain Question Answering : A Survey

This work provides a structured overview of standard WS signals used for training a NR model and divides them into three main categories, based on their required resources, which are summarized in this work.

Low-Resource Dense Retrieval for Open-Domain Question Answering: A Comprehensive Survey

A thorough structured overview of mainstream techniques for low-resource DR is provided, which divides the techniques into three main categories: only documents are needed; only documents and questions are needs; and (3) documents and question-answer pairs are needed.

Learning Domain‐specific Semantic Representation from Weakly Supervised Data to Improve Research Dataset Retrieval

This work investigated the use of semantically rich information to retrieve relevant datasets and the benefits of using domain‐specific dense vector representation as opposed to general representation, and proposed a fine‐tuned model that can improve the NDCG@10 score.

Overcoming low-utility facets for complex answer retrieval

This work proposes two estimators of facet utility: the hierarchical structure of CAR queries, and facet frequency information from training data, and includes entity similarity scores using embeddings trained from a CAR knowledge graph, which captures the context of facets.

Overcoming low-utility facets for complex answer retrieval

This work proposes two estimators of facet utility: the hierarchical structure of CAR queries, and facet frequency information from training data, and includes entity similarity scores using embeddings trained from a CAR knowledge graph, which captures the context of facets.

Neural Ranking Models with Weak Supervision

This paper proposes to train a neural ranking model using weak supervision, where labels are obtained automatically without human annotators or any external resources, and suggests that supervised neural ranking models can greatly benefit from pre-training on large amounts of weakly labeled data that can be easily obtained from unsupervised IR models.

A Position-Aware Deep Model for Relevance Matching in Information Retrieval

This work presents a novel model architecture consisting of convolutional layers to capture term dependencies and proximity among query term occurrences, followed by a recurrent layer to capture relevance over di‚erent query terms.

A Deep Relevance Matching Model for Ad-hoc Retrieval

A novel deep relevance matching model (DRMM) for ad-hoc retrieval that employs a joint deep architecture at the query term level for relevance matching and can significantly outperform some well-known retrieval models as well as state-of-the-art deep matching models.

PACRR: A Position-Aware Neural IR Model for Relevance Matching

This work proposes a novel neural IR model named PACRR aiming at better modeling position-dependent interactions between a query and a document and yields better results under multiple benchmarks.

Learning to Match using Local and Distributed Representations of Text for Web Search

This work proposes a novel document ranking model composed of two separate deep neural networks, one that matches the query and the document using a local representation, and another that Matching with distributed representations complements matching with traditional local representations.

End-to-End Neural Ad-hoc Ranking with Kernel Pooling

K-NRM uses a translation matrix that models word-level similarities via word embeddings, a new kernel-pooling technique that uses kernels to extract multi-level soft match features, and a learning-to-rank layer that combines those features into the final ranking score.

Pseudo test collections for learning web search ranking functions

Experiments carried out on TREC web track data show that learning to rank models trained using pseudo test collections outperform an unsupervised ranking function and are statistically indistinguishable from a model trained using manual judgments, demonstrating the usefulness of the approach in extracting reasonable quality training data "for free".

Cumulated gain-based evaluation of IR techniques

This article proposes several novel measures that compute the cumulative gain the user obtains by examining the retrieval result up to a given ranked position, and test results indicate that the proposed measures credit IR methods for their ability to retrieve highly relevant documents and allow testing of statistical significance of effectiveness differences.

A Study of MatchPyramid Models on Ad-hoc Retrieval

The MatchPyramid models can significantly outperform several recently introduced deep matching models on the retrieval task, but still cannot compete with the traditional retrieval models, such as BM25 and language models.

Pseudo test collections for training and tuning microblog rankers

This work describes a method for generating queries and relevance judgments for microblog search in an unsupervised way, and uses pseudo test collections as training sets in a learning to rank scenario.