An Approach for Weakly-Supervised Deep Information Retrieval

  title={An Approach for Weakly-Supervised Deep Information Retrieval},
  author={Sean MacAvaney and Kai Hui and Andrew Yates},
Recent developments in neural information retrieval models have been promising, but a problem remains: human relevance judgments are expensive to produce, while neural models require a considerable amount of training data. In an attempt to fill this gap, we present an approach that---given a weak training set of pseudo-queries, documents, relevance information---filters the data to produce effective positive and negative query-document pairs. This allows large corpora to be used as neural IR… 

Tables from this paper

Selective Weak Supervision for Neural Information Retrieval

The classic IR intuition that anchor-document relations approximate query-document relevance is revisited and a reinforcement weak supervision selection method, ReInfoSelect, which learns to select anchor- document pairs that best weakly supervise the neural ranker (action), using the ranking performance on a handful of relevance labels as the reward.

Investigating Weak Supervision in Deep Ranking

A cascade ranking framework is proposed to combine the two weakly supervised relevance, which significantly promotes the ranking performance of neural ranking models and outperforms the best result in the last NTCIR-13 The authors Want Web (WWW) task.

Sogou-QCL: A New Dataset with Click Relevance Label

A new dataset, Sogou-QCL, is presented, which contains 537,366 queries and five kinds of weak relevance labels for over 12 million query-document pairs and is applied to train recent neural ranking models and shows its potential to serve as weak supervision for ranking.

Embedding-based Zero-shot Retrieval through Query Generation

This work considers the embedding-based two-tower architecture as the neural retrieval model and proposes a novel method for generating synthetic training data for retrieval, which produces remarkable results, significantly outperforming BM25 on 5 out of 6 datasets tested.

Context-aware ranking refinement with attentive semi-supervised autoencoders

This work proposes an attentive semi-supervised autoencoder to refine the ranked results using an optimized ranking-oriented reconstruction loss and devise the hybrid listwise query constraints to capture the characteristics of relevant documents for different queries.

Low-Resource Dense Retrieval for Open-Domain Question Answering: A Comprehensive Survey

A thorough structured overview of mainstream techniques for low-resource DR, dividing the techniques into three main categories based on their required resources, and highlighting the open issues and pros and cons.

Learning Domain‐specific Semantic Representation from Weakly Supervised Data to Improve Research Dataset Retrieval

This work investigated the use of semantically rich information to retrieve relevant datasets and the benefits of using domain‐specific dense vector representation as opposed to general representation, and proposed a fine‐tuned model that can improve the NDCG@10 score.

Context-Aware Document Term Weighting for Ad-Hoc Search

Experiments show that an index using HDCT weights significantly improved the retrieval accuracy compared to typical term-frequency and state-of-the-art embedding-based indexes.

Passage Ranking with Weak Supervsion

This paper trains a BERT-based passage-ranking model (which achieves new state-of-the-art performances on two benchmark datasets with full supervision) in a weak supervision framework and considers two sources of weak supervision signals, unsupervised ranking functions and semantic feature similarities.

Report on the Second SIGIR Workshop on Neural Information Retrieval (Neu-IR'17)

The second SIGIR workshop on neural information retrieval (Neu-IR?17) took place on August 11, 2017, in Tokyo, Japan, and focused on resources for evaluation and reproducibility, including proposals for public benchmarking datasets and shared model repositories.



Neural Ranking Models with Weak Supervision

This paper proposes to train a neural ranking model using weak supervision, where labels are obtained automatically without human annotators or any external resources, and suggests that supervised neural ranking models can greatly benefit from pre-training on large amounts of weakly labeled data that can be easily obtained from unsupervised IR models.

A Position-Aware Deep Model for Relevance Matching in Information Retrieval

This work presents a novel model architecture consisting of convolutional layers to capture term dependencies and proximity among query term occurrences, followed by a recurrent layer to capture relevance over di‚erent query terms.

A Deep Relevance Matching Model for Ad-hoc Retrieval

A novel deep relevance matching model (DRMM) for ad-hoc retrieval that employs a joint deep architecture at the query term level for relevance matching and can significantly outperform some well-known retrieval models as well as state-of-the-art deep matching models.

PACRR: A Position-Aware Neural IR Model for Relevance Matching

This work proposes a novel neural IR model named PACRR aiming at better modeling position-dependent interactions between a query and a document and yields better results under multiple benchmarks.

Learning to Match using Local and Distributed Representations of Text for Web Search

This work proposes a novel document ranking model composed of two separate deep neural networks, one that matches the query and the document using a local representation, and another that Matching with distributed representations complements matching with traditional local representations.

End-to-End Neural Ad-hoc Ranking with Kernel Pooling

K-NRM uses a translation matrix that models word-level similarities via word embeddings, a new kernel-pooling technique that uses kernels to extract multi-level soft match features, and a learning-to-rank layer that combines those features into the final ranking score.

Pseudo test collections for learning web search ranking functions

Experiments carried out on TREC web track data show that learning to rank models trained using pseudo test collections outperform an unsupervised ranking function and are statistically indistinguishable from a model trained using manual judgments, demonstrating the usefulness of the approach in extracting reasonable quality training data "for free".

Cumulated gain-based evaluation of IR techniques

This article proposes several novel measures that compute the cumulative gain the user obtains by examining the retrieval result up to a given ranked position, and test results indicate that the proposed measures credit IR methods for their ability to retrieve highly relevant documents and allow testing of statistical significance of effectiveness differences.

A Study of MatchPyramid Models on Ad-hoc Retrieval

The MatchPyramid models can significantly outperform several recently introduced deep matching models on the retrieval task, but still cannot compete with the traditional retrieval models, such as BM25 and language models.

Pseudo test collections for training and tuning microblog rankers

This work describes a method for generating queries and relevance judgments for microblog search in an unsupervised way, and uses pseudo test collections as training sets in a learning to rank scenario.