A study of smoothing methods for language models applied to Ad Hoc information retrieval

  title={A study of smoothing methods for language models applied to Ad Hoc information retrieval},
  author={ChengXiang Zhai and John D. Lafferty},
  booktitle={SIGIR '01},
Language modeling approaches to information retrieval are attractive and promising because they connect the problem of retrieval with that of language model estimation, which has been studied extensively in other application areas such as speech recognition. The basic idea of these approaches is to estimate a language model for each document, and then rank documents by the likelihood of the query according to the estimated language model. A core problem in language model estimation is smoothing… 

Figures and Tables from this paper

A Study of Language Model for Image Retrieval

This paper investigates and discusses whether language model approaches can be adapted to content based image retrieval (CBIR), based on the “bag of visual words” image representation, and performs extensive studies over different smoothing methods, strategies, and parameters.

Language models and smoothing methods for information retrieval

A new language model based on an odds formula, which explicitly incorporates document length as a parameter is presented, and a new smoothing method called exponential smoothing is introduced, which can be combined with most language models and improves the accuracy of the estimated language model.

Clusters, language models, and ad hoc information retrieval

A novel algorithmic framework in which information provided by document-based language models is enhanced by the incorporation of information drawn from clusters of similar documents is proposed, and a suite of new algorithms are developed.

Term-specific smoothing for the language modeling approach to information retrieval: the importance of a query term

The new language modeling approach is shown to explain a number of practical facts of today's information retrieval systems that are not very well explained by the current state of information retrieval theory, including stop words, mandatory terms, coordination level ranking and retrieval using phrases.

A Comparative Study of Probabalistic and Language Models for Information Retrieval

For ad hoc retrieval, the Dirichlet smoothing method was found to be significantly better than Okapi BM25, but for named-page finding OkAPI BM25 was more effective than the language modelling methods.

Two-stage language models for information retrieval

Evaluation on five different databases and four types of queries indicates that the two-stage smoothing method with the proposed parameter estimation methods consistently gives retrieval performance that is close to---or better than---the best results achieved using a single smoothed method and exhaustive parameter search on the test data.

Title language model for information retrieval

In the experiments with four different TREC document collections, the title language model for information retrieval with the new smoothing method outperforms both the traditional language model and the vector space model for IR significantly.

GJM-2: A Special Case of General Jelinek-Mercer Smoothing Method for Language Modeling Approach to Ad Hoc IR

Experimental results show that using GJM-2 for the language modeling approach can achieve better retrieval performances than the existing three popular methods both on short and long queries.

A summary based language retrieval method

A summary-biased approach to study the use of internal structures for the estimation of document language model based on the hypothesis that query-biased summary presents the information that is most relevant to a query is proposed.

Risk Minimization and Language Modeling in Text Retrieval – Thesis Summary

This thesis presents a new general probabilistic framework for text retrieval based on Bayesian decision theory, and shows that it is possible to achieve excellent retrieval performance without any ad hoc parameter tuning by exploiting statistical estimation methods to set the retrieval parameters completely automatically.



A general language model for information retrieval

A new language model for information retrieval is presented, which is based on a range of data smoothing techniques, including the Good-Turning estimate, curve-fitting functions, and model combinations, and can be easily extended to incorporate probabilities of phrases such as word pairs and word triples.

Information retrieval as statistical translation

A simple, well motivated model of the document-to-query translation process is proposed, and an algorithm for learning the parameters of this model in an unsupervised manner from a collection of documents is described.

A hidden Markov model information retrieval system

A novel method for performing blind feedback in the HMM framework, a more complex HMM that models bigram production, and several other algorithmic re nements form a state-of-the-art retrieval system that ranked among the best on the TREC-7 ad hoc retrieval task.

Probabilistic Models in Information Retrieval

  • N. Fuhr
  • Computer Science
    Comput. J.
  • 1992
An introduction and survey over probabilistic information retrieval (IR) is given: the probability-ranking principle shows that optimum retrieval quality can be achieved under certain assumptions; a conceptual model for IR along with the corresponding event space clarify the interpretation of the Probabilistic parameters involved.

On modeling information retrieval with probabilistic inference

This article examines and extends the logical models of information retrieval in the context of probability theory. The fundamental notions of term weights and relevance are given probabilistic

Estimation of probabilities from sparse data for the language model component of a speech recognizer

  • S. Katz
  • Computer Science
    IEEE Trans. Acoust. Speech Signal Process.
  • 1987
The model offers, via a nonlinear recursive procedure, a computation and space efficient solution to the problem of estimating probabilities from sparse data, and compares favorably to other proposed methods.

A hierarchical Dirichlet language model

A hierarchical probabilistic model whose predictions are similar to those of the popular language modelling procedure known as ‘smoothing’ is discussed, and the methods prove to be about equally accurate, with the hierarchical model using fewer computational resources.

A Non-Classical Logic for Information Retrieval

This paper is to be seen as describing a new theoretical framework for investigating information retrieval, and it is suggested that some attempt should be made to construct something like a naive model, using more than just keywords, of the content of each document in the system.

Pivoted document length normalization

Pivoted normalization is presented, a technique that can be used to modify any normalization function thereby reducing the gap between the relevance and the retrieval probabilities, and two new normalization functions are presented–-pivoted unique normalization and piuotert byte size nornaahzation.

Okapi at TREC-3

During the course of TREC{1 the low-level search functions were split o into a separate Basic Search System (BSS) [2], but retrieval and ranking of documents was still done using the \classical"