Information Retrieval as Statistical Translation


We propose a new probabilistic approach to information retrieval based upon the ideas and methods of statistical machine translation. The central ingredient in this approach is a statistical model of how a user might distill or "translate" a given document into a query. To assess the relevance of a document to a user's query, we estimate the probability that the query would have been generated as a translation of the document, and factor in the user's general preferences in the form of a prior distribution over documents. We propose a simple, well motivated model of the document-to-query translation process, and describe an algorithm for learning the parameters of this model in an unsupervised manner from a collection of documents. As we show, one can view this approach as a generalization and justification of the "language modeling" strategy recently proposed by Ponte and Croft. In a series of experiments on TREC data, a simple translation-based retrieval system performs well in comparison to conventional retrieval techniques. This prototype system only begins to tap the full potential of translation-based retrieval.

DOI: 10.1145/3130348.3130371

Extracted Key Phrases

Unfortunately, ACM prohibits us from displaying non-influential references for this paper.

To see the full reference list, please visit

Citations per Year

696 Citations

Semantic Scholar estimates that this publication has received between 586 and 828 citations based on the available data.

See our FAQ for additional information.