Probabilistic Latent Semantic Indexing

Abstract

Probabilistic Latent Semantic Indexing is a novel approach to automated document indexing which is based on a statistical latent class model for factor analysis of count data. Fitted from a training corpus of text documents by a generalization of the Expectation Maximization algorithm, the utilized model is able to deal with domain{specific synonymy as well as with polysemous words. In contrast to standard Latent Semantic Indexing (LSI) by Singular Value Decomposition, the probabilistic variant has a solid statistical foundation and defines a proper generative data model. Retrieval experiments on a number of test collections indicate substantial performance gains over direct term matching methods as well as over LSI. In particular, the combination of models with different dimensionalities has proven to be advantageous.

DOI: 10.1145/3130348.3130370

Extracted Key Phrases

Unfortunately, ACM prohibits us from displaying non-influential references for this paper.

To see the full reference list, please visit http://dl.acm.org/citation.cfm?id=312649.

Showing 1-10 of 2,167 extracted citations
0200400'00'02'04'06'08'10'12'14'16
Citations per Year

3,937 Citations

Semantic Scholar estimates that this publication has received between 3,649 and 4,249 citations based on the available data.

See our FAQ for additional information.