• Corpus ID: 222132914

Leveraging Semantic and Lexical Matching to Improve the Recall of Document Retrieval Systems: A Hybrid Approach

  title={Leveraging Semantic and Lexical Matching to Improve the Recall of Document Retrieval Systems: A Hybrid Approach},
  author={Saar Kuzi and Mingyang Zhang and Cheng Li and Michael Bendersky and Marc Najork},
Search engines often follow a two-phase paradigm where in the first stage (the retrieval stage) an initial set of documents is retrieved and in the second stage (the re-ranking stage) the documents are re-ranked to obtain the final result list. While deep neural networks were shown to improve the performance of the re-ranking stage in previous works, there is little literature about using deep neural networks to improve the retrieval stage. In this paper, we study the merits of combining deep… 

Semantic Models for the First-Stage Retrieval: A Comprehensive Review

The current landscape of the first-stage retrieval models under a unified framework is described to clarify the connection between classical term-based retrieval methods, early semantic retrieved methods, and neural semantic retrieval methods.

Early Stage Sparse Retrieval with Entity Linking

This work proposes boosting the performance of sparse retrievers by expanding both the queries and the documents with linked entities in two formats for the entity names: explicit and hashed, and adopts a run fusion approach to maximize the benefits of entity linking.

UnifieR: A Unified Retriever for Large-Scale Retrieval

A new learning framework, Uni R, is proposed, which combines dense-vector and lexicon-based retrieval in one model with a dual-representing capability, and experiments on passage retrieval benchmarks verify its effectiveness in both paradigms.

Complement Lexical Retrieval Model with Semantic Residual Embeddings

Empirical evaluations demonstrate the advantages of clear over state-of-the-art retrieval models, and that it can substantially improve the end-to-end accuracy and efficiency of reranking pipelines.

Out-of-Domain Semantics to the Rescue! Zero-Shot Hybrid Retrieval Models

This work carefully selects five datasets, and proposes a simple yet effective framework to integrate lexical and deep retrieval models, demonstrating that these two models are complementary, even when the deep model is weaker in the out-of-domain setting.

LED: Lexicon-Enlightened Dense Retriever for Large-Scale Retrieval

This work proposes to make a dense retriever align a well-performing lexicon-aware representation model and finds its improve-ment is complementary to the standard ranker distillation, which can further lift state-of-the-art performance.

A BERT-based Siamese-structured Retrieval Model

A BERT-based Siamese-structured retrieval model (BESS) is proposed that not only inherits the merits of pre-trained language models, but also can generate extra information to compensate the original query automatically and the reinforcement learning strategy is introduced to make the model more robust.

Improving Transformer-Kernel Ranking Model Using Conformer and Query Term Independence

This work proposes a novel Conformer layer as an alternative approach to scale TK to longer input sequences and incorporates query term independence and explicit term matching to extend the model to the full retrieval setting.

Pseudo-relevance feedback based query expansion using boosting algorithm

The boosting query term method was proposed to reweigh and strengthen the original query and effectively identified the most relevant keywords, and that was true even for short queries.

Low-Resource Dense Retrieval for Open-Domain Question Answering: A Comprehensive Survey

A thorough structured overview of mainstream techniques for low-resource DR, dividing the techniques into three main categories based on their required resources, and highlighting the open issues and pros and cons.



Learning deep structured semantic models for web search using clickthrough data

A series of new latent semantic models with a deep structure that project queries and documents into a common low-dimensional space where the relevance of a document given a query is readily computed as the distance between them are developed.

Learning to Match using Local and Distributed Representations of Text for Web Search

This work proposes a novel document ranking model composed of two separate deep neural networks, one that matches the query and the document using a local representation, and another that Matching with distributed representations complements matching with traditional local representations.

Context-Aware Sentence/Passage Term Importance Estimation For First Stage Retrieval

A Deep Contextualized Term Weighting framework that learns to map BERT's contextualized text representations to context-aware term weights for sentences and passages to improve the accuracy of first-stage retrieval algorithms.

A Latent Semantic Model with Convolutional-Pooling Structure for Information Retrieval

A new latent semantic model that incorporates a convolutional-pooling structure over word sequences to learn low-dimensional, semantic vector representations for search queries and Web documents is proposed.

Incorporating Query Term Independence Assumption for Efficient Retrieval and Ranking using Deep Neural Networks

By operating on each query term independently, these otherwise computationally intensive models become amenable to offline precomputation---dramatically reducing the cost of query evaluations employing state-of-the-art neural ranking models.

Neural Vector Spaces for Unsupervised Information Retrieval

It is found that an unsupervised ensemble of multiple models trained with different hyperparameter values performs better than a single cross-validated model, and therefore NVSM can safely be used for ranking documents without supervised relevance judgments.

Off the Beaten Path: Let's Replace Term-Based Retrieval with k-NN Search

This work replaces the term-based search with a generic k-NN retrieval algorithm, where a similarity function can take into account subtle term associations, and demonstrates that an approximate algorithm can be nearly two orders of magnitude faster at the expense of only a small loss in accuracy.

Latent semantic indexing (LSI) fails for TREC collections

This paper finds that LSI yields poor retrieval accuracy on the TREC 2, 7, 8, and 2004 collections, and derives novel scoring methods that implement the ideas of query expansion and score regularization in the LSI framework.

End-to-End Neural Ad-hoc Ranking with Kernel Pooling

K-NRM uses a translation matrix that models word-level similarities via word embeddings, a new kernel-pooling technique that uses kernels to extract multi-level soft match features, and a learning-to-rank layer that combines those features into the final ranking score.

A Deep Relevance Matching Model for Ad-hoc Retrieval

A novel deep relevance matching model (DRMM) for ad-hoc retrieval that employs a joint deep architecture at the query term level for relevance matching and can significantly outperform some well-known retrieval models as well as state-of-the-art deep matching models.