Topical N-Grams: Phrase and Topic Discovery, with an Application to Information Retrieval

  title={Topical N-Grams: Phrase and Topic Discovery, with an Application to Information Retrieval},
  author={Xuerui Wang and Andrew McCallum and Xing Wei},
  journal={Seventh IEEE International Conference on Data Mining (ICDM 2007)},
Most topic models, such as latent Dirichlet allocation, rely on the bag-of-words assumption. [] Key Method Successive bigrams form longer phrases. We present experiments showing meaningful phrases and more interpretable topics from the NIPS data and improved information retrieval performance on a TREC collection.

Scalable Topical Phrase Mining from Text Corpora

This work proposes a novel phrase mining framework to segment a document into single and multi-word phrases, and a new topic model that operates on the induced document partition that discovers high quality topical phrases with negligible extra cost to the bag-of-words topic model in a variety of datasets.

A Phrase Topic Model for Large-scale Corpus

This work proposes a phrase topic model based on the LDA model, which integrates a regular expression constraint condition, and makes the topic more meaningful and interpretable based on a limited increase in the dimensions of the vocabulary.

Review Topic Discovery with Phrases using the Pólya Urn Model

This paper proposes to use the generalized Polya Urn (GPU) model to solve the topic modelling problem, which gives superior results and enables the connection of a phrase with its content words naturally.

Enhancing Topical Word Semantic for Relevance Feature Selection

An innovative and effective extended random sets (ERS) model is presented to enhance the semantic of topical words and significantly outperforms eight, state-of-the-art, baseline models in five standard performance measures.

LDA-PSTR: A Topic Modeling Method for Short Text

This paper applies frequent pattern mining to uncover statistically significant patterns which can explicitly capture semantic association and co-occurrences among corpus-level words, and proposes a new probabilistic topic model called LDA-PSTR.

SenU-PTM: a novel phrase-based topic model for short-text topic discovery by exploiting word embeddings

SenU-PTM reveals that modeling topics on sense units can solve the sparsity of short texts and improve the readability of topics at the same time.

Discovering latent topical phrases in document collections and networks with text components: Leveraging text mining and information network analysis for human oriented applications

A framework that generates high-quality topics represented by integrated lists of mixed-length phrases, and an approach to constructing hierarchical topics, which extends the phrase-centric approach to create high quality phrases at varying levels of granularity are described.

Bigram Anchor Words Topic Model

This paper offers an approach to accounting bigrams (two-word phrases) for the construction of Anchor Words Topic Model, a probabilistic topic model that allows extracting a number of topics in the collection and describes each document as a discrete probability distribution over topics.

Personalized Multi-Document Summarization using N-Gram Topic Model Fusion

A unified topic model which evolves from sentence-term and sentence-bigram co-occurrences in parallel is presented, built on a considerably simpler model than previous topic modeling approaches to summarization.

Labeled Phrase Latent Dirichlet Allocation and its online learning algorithm

This paper proposes a novel topic model, called Labeled Phrase Latent Dirichlet Allocation (LPLDA), which regards each document as a mixture of phrases and partly considers the word order, and develops a batch inference algorithm based on Gibbs sampling technique for LPLDA.



Topic modeling: beyond bag-of-words

A hierarchical generative probabilistic model that incorporates both n-gram statistics and latent topic variables by extending a unigram topic model to include properties of a hierarchical Dirichlet bigram language model is explored.

Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model

A new probabilistic model is proposed that tempers this approach by representing each document as a combination of a background distribution over common words, a mixture distribution over general topics, and a distribution over words that are treated as being specific to that document.

LDA-based document models for ad-hoc retrieval

This paper proposes an LDA-based document model within the language modeling framework, and evaluates it on several TREC collections, and shows that improvements over retrieval using cluster-based models can be obtained with reasonable efficiency.

Latent Dirichlet Allocation

An Analysis of Statistical and Syntactic Phrases

It is discovered that once a good basic ranking scheme is being used, the use of phrases does not have a major effect on precision at high ranks, and phrases are more useful at lower ranks where the connection between documents and relevance is more tenuous.

A study of smoothing methods for language models applied to information retrieval

Evaluation on five different databases and four types of queries indicates that the two-stage smoothing method with the proposed parameter estimation methods consistently gives retrieval performance that is close to or better than the best results achieved using a single smoothing methods and exhaustive parameter search on the test data.

Automatic indexing using selective NLP and first-order thesauri

In an evaluation comparing CLARIT automatic indexing of ten full-text articles in the domain of artificial intelligence to theindexing of two human subjects, it was found thatCLARIT performed as well---and in some respects better---than the humans.

The effectiveness of a nonsyntatic approach to automatic phrase indexing for document retrieval

It is not likely that phrase indexing of this kind will prove to be an important method of enhancing the performance of automatic document indexing and retrieval systems in operational environments, and a general syntactic analysis facility may be required.

Retrieving Collocations from Text: Xtract

A set of techniques based on statistical methods for retrieving and identifying collocations from large textual corpora, based on some original filtering methods that allow the production of richer and higher-precision output are described.

Word Association Norms, Mutual Information and Lexicography

The proposed measure, the association ratio, estimates word association norms directly from computer readable corpora, making it possible to estimate norms for tens of thousands of words.