Topical N-Grams: Phrase and Topic Discovery, with an Application to Information Retrieval
@article{Wang2007TopicalNP, title={Topical N-Grams: Phrase and Topic Discovery, with an Application to Information Retrieval}, author={Xuerui Wang and Andrew McCallum and Xing Wei}, journal={Seventh IEEE International Conference on Data Mining (ICDM 2007)}, year={2007}, pages={697-702} }
Most topic models, such as latent Dirichlet allocation, rely on the bag-of-words assumption. [] Key Method Successive bigrams form longer phrases. We present experiments showing meaningful phrases and more interpretable topics from the NIPS data and improved information retrieval performance on a TREC collection.
512 Citations
Scalable Topical Phrase Mining from Text Corpora
- Computer ScienceProc. VLDB Endow.
- 2014
This work proposes a novel phrase mining framework to segment a document into single and multi-word phrases, and a new topic model that operates on the induced document partition that discovers high quality topical phrases with negligible extra cost to the bag-of-words topic model in a variety of datasets.
A Phrase Topic Model for Large-scale Corpus
- Computer Science2019 IEEE 4th International Conference on Cloud Computing and Big Data Analysis (ICCCBDA)
- 2019
This work proposes a phrase topic model based on the LDA model, which integrates a regular expression constraint condition, and makes the topic more meaningful and interpretable based on a limited increase in the dimensions of the vocabulary.
Review Topic Discovery with Phrases using the Pólya Urn Model
- Computer ScienceCOLING
- 2014
This paper proposes to use the generalized Polya Urn (GPU) model to solve the topic modelling problem, which gives superior results and enables the connection of a phrase with its content words naturally.
Enhancing Topical Word Semantic for Relevance Feature Selection
- Computer ScienceSML@IJCAI
- 2017
An innovative and effective extended random sets (ERS) model is presented to enhance the semantic of topical words and significantly outperforms eight, state-of-the-art, baseline models in five standard performance measures.
LDA-PSTR: A Topic Modeling Method for Short Text
- Computer ScienceADMA
- 2018
This paper applies frequent pattern mining to uncover statistically significant patterns which can explicitly capture semantic association and co-occurrences among corpus-level words, and proposes a new probabilistic topic model called LDA-PSTR.
SenU-PTM: a novel phrase-based topic model for short-text topic discovery by exploiting word embeddings
- Computer ScienceData Technol. Appl.
- 2021
SenU-PTM reveals that modeling topics on sense units can solve the sparsity of short texts and improve the readability of topics at the same time.
Discovering latent topical phrases in document collections and networks with text components: Leveraging text mining and information network analysis for human oriented applications
- Computer Science
- 2014
A framework that generates high-quality topics represented by integrated lists of mixed-length phrases, and an approach to constructing hierarchical topics, which extends the phrase-centric approach to create high quality phrases at varying levels of granularity are described.
Bigram Anchor Words Topic Model
- Computer ScienceAIST
- 2016
This paper offers an approach to accounting bigrams (two-word phrases) for the construction of Anchor Words Topic Model, a probabilistic topic model that allows extracting a number of topics in the collection and describes each document as a discrete probability distribution over topics.
Personalized Multi-Document Summarization using N-Gram Topic Model Fusion
- Computer Science
- 2010
A unified topic model which evolves from sentence-term and sentence-bigram co-occurrences in parallel is presented, built on a considerably simpler model than previous topic modeling approaches to summarization.
Labeled Phrase Latent Dirichlet Allocation and its online learning algorithm
- Computer ScienceData Mining and Knowledge Discovery
- 2018
This paper proposes a novel topic model, called Labeled Phrase Latent Dirichlet Allocation (LPLDA), which regards each document as a mixture of phrases and partly considers the word order, and develops a batch inference algorithm based on Gibbs sampling technique for LPLDA.
References
SHOWING 1-10 OF 29 REFERENCES
Topic modeling: beyond bag-of-words
- Computer ScienceICML
- 2006
A hierarchical generative probabilistic model that incorporates both n-gram statistics and latent topic variables by extending a unigram topic model to include properties of a hierarchical Dirichlet bigram language model is explored.
Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model
- Computer ScienceNIPS
- 2006
A new probabilistic model is proposed that tempers this approach by representing each document as a combination of a background distribution over common words, a mixture distribution over general topics, and a distribution over words that are treated as being specific to that document.
LDA-based document models for ad-hoc retrieval
- Computer ScienceSIGIR
- 2006
This paper proposes an LDA-based document model within the language modeling framework, and evaluates it on several TREC collections, and shows that improvements over retrieval using cluster-based models can be obtained with reasonable efficiency.
An Analysis of Statistical and Syntactic Phrases
- Computer ScienceRIAO
- 1997
It is discovered that once a good basic ranking scheme is being used, the use of phrases does not have a major effect on precision at high ranks, and phrases are more useful at lower ranks where the connection between documents and relevance is more tenuous.
A study of smoothing methods for language models applied to information retrieval
- Computer ScienceTOIS
- 2004
Evaluation on five different databases and four types of queries indicates that the two-stage smoothing method with the proposed parameter estimation methods consistently gives retrieval performance that is close to or better than the best results achieved using a single smoothing methods and exhaustive parameter search on the test data.
Automatic indexing using selective NLP and first-order thesauri
- Computer ScienceRIAO
- 1991
In an evaluation comparing CLARIT automatic indexing of ten full-text articles in the domain of artificial intelligence to theindexing of two human subjects, it was found thatCLARIT performed as well---and in some respects better---than the humans.
The effectiveness of a nonsyntatic approach to automatic phrase indexing for document retrieval
- Computer Science
- 1989
It is not likely that phrase indexing of this kind will prove to be an important method of enhancing the performance of automatic document indexing and retrieval systems in operational environments, and a general syntactic analysis facility may be required.
Retrieving Collocations from Text: Xtract
- Computer ScienceComput. Linguistics
- 1993
A set of techniques based on statistical methods for retrieving and identifying collocations from large textual corpora, based on some original filtering methods that allow the production of richer and higher-precision output are described.
Word Association Norms, Mutual Information and Lexicography
- LinguisticsACL
- 1989
The proposed measure, the association ratio, estimates word association norms directly from computer readable corpora, making it possible to estimate norms for tens of thousands of words.