Combining Thesaurus Knowledge and Probabilistic Topic Models

  title={Combining Thesaurus Knowledge and Probabilistic Topic Models},
  author={Natalia V. Loukachevitch and Michael Nokel and Kirill Ivanov},
In this paper we present the approach of introducing thesaurus knowledge into probabilistic topic models. The main idea of the approach is based on the assumption that the frequencies of semantically related words and phrases, which are met in the same texts, should be enhanced: this action leads to their larger contribution into topics found in these texts. We have conducted experiments with several thesauri and found that for improving topic models, it is useful to utilize domain-specific… 

Discovering Interpretable Topics by Leveraging Common Sense Knowledge

The Common Sense Topic Model (CSTM) is introduced, a novel and efficient approach that augments clustering with knowledge extracted from the ConceptNet knowledge graph that shows superior affinity to human judgement.

Topic Modelling of the Russian Corpus of Pikabu Posts: Author-Topic Distribution and Topic Labelling

A balanced and representative corpus of Russian posts with hash tags based on Pikabu social network is developed to develop probabilistic topic models revealing the authors’ interests and preferences, as well as correlation of topics within the corpus.

Analysis of Influence of Different Relations Types on the Quality of Thesaurus Application to Text Classification Problems

The results make it possible to evaluate the efficiency of the application of thesaurus relations for classification of raw texts and to determine under what conditions certain relations affect more or less.

Topic Modelling with NMF vs. Expert Topic Annotation: The Case Study of Russian Fiction

Experimental results showed that topic modelling via NMF should be primarily recommended for the revealing of topics referring to general background of literary texts rather than for detecting topics related with some critical events or relations between characters.



Discovering coherent topics using general knowledge

A framework to leverage the general knowledge in topic models, called GK-LDA, which is able to effectively exploit the knowledge of lexical relations in dictionaries and is the first such model that can incorporate the domain independent knowledge.

Accounting ngrams and multi-word terms can improve topic models

A novel algorithm LDA-ITER is proposed that allows the incorporation of the most suitable ngrams into topic models, while maintaining similarities between them and words based on their component structure.

Visualizing Topics with Multi-Word Expressions

A new method for visualizing topics, the distributions over terms that are automatically extracted from large text corpora using latent variable models, based on a language model of arbitrary length expressions, which outperforms the more standard use of $\chi^2$ and likelihood ratio tests.

Incorporating Word Correlation Knowledge into Topic Modeling

A Markov Random Field regularized Latent Dirichlet Allocation model, which defines a MRF on the latent topic layer of LDA to encourage words labeled as similar to share the same topic label, and can accommodate the subtlety that whether two words are similar depends on which topic they appear in.

On collocations and topic models

It is shown how the Akaike information criterion is a more appropriate measure, which suggests that using a modest number of top-ranked bigrams is the optimal topic modelling configuration, and that using these 1000 bigrams results in improved topic quality over unigram tokenization.

Topical N-Grams: Phrase and Topic Discovery, with an Application to Information Retrieval

Most topic models, such as latent Dirichlet allocation, rely on the bag-of-words assumption. However, word order and phrases are often critical to capturing the meaning of text in many text mining

Improving Topic Coherence with Regularized Topic Models

This work proposes two methods to regularize the learning of topic models by creating a structured prior over words that reflect broad patterns in the external data that make topic models more useful across a broader range of text data.

Topics in semantic representation.

This article analyzes the abstract computational problem underlying the extraction and use of gist, formulating this problem as a rational statistical inference that leads to a novel approach to semantic representation in which word meanings are represented in terms of a set of probabilistic topics.

Reading Tea Leaves: How Humans Interpret Topic Models

New quantitative methods for measuring semantic meaning in inferred topics are presented, showing that they capture aspects of the model that are undetected by previous measures of model quality based on held-out likelihood.

Topic modeling: beyond bag-of-words

A hierarchical generative probabilistic model that incorporates both n-gram statistics and latent topic variables by extending a unigram topic model to include properties of a hierarchical Dirichlet bigram language model is explored.