• Corpus ID: 18593743

Software Framework for Topic Modelling with Large Corpora

@inproceedings{Rehurek2010SoftwareFF,
  title={Software Framework for Topic Modelling with Large Corpora},
  author={Radim Rehurek and Petr Sojka},
  year={2010}
}
Large corpora are ubiquitous in today's world and memory quickly becomes the limiting factor in practical applications of the Vector Space Model (VSM). We identify gap in existing VSM implementations, which is their scalability and ease of use. We describe a Natural Language Processing software framework which is based on the idea of document streaming, i.e. processing corpora document after document, in a memory independent fashion. In this framework, we implement several popular algorithms… 

A new evaluation framework for topic modeling algorithms based on synthetic corpora

TLDR
A new framework for the evaluation of topic modeling algorithms based on synthetic corpora containing an unambiguously defined ground truth topic structure is proposed, with the ability to quantify the agreement between the planted and inferred topic structures by comparing the assigned topic labels at the level of the tokens.

Learning Word Relatedness over Time

TLDR
This work introduces a temporal relationship model that is extracted from longitudinal data collections that supports the task of identifying, given two words, when they relate to each other and presents an algorithmic framework for this task.

Fast and Effective Approximations for Summarization and Categorization of Very Large Text Corpora

TLDR
This thesis contributes a hybrid approach wherein simple probability models provide dramatic dimensionality reduction to linear algebraic problems, resulting in computationally efficient solutions suitable for real-time human interaction.

An analysis of the coherence of descriptors in topic modeling

Review and Implementation of Topic Modeling in Hindi

TLDR
The challenges faced in developing topic models for Hindi are discussed and the results of Topic modeling in Hindi seem to be promising and comparable to some results reported in the literature on English datasets.

Probabilistic Topic Models in Natural Language Processing

TLDR
This paper’s objective is to guide the reader from a casual understanding of basic statistical concepts, such as those typically acquired in undergraduate studies, to an understanding of topic models.

Mining and Learning from Multilingual Text Collections using Topic Models and Word Embeddings. (Explorer et Apprendre à partir de collections de textes multilingues à l'aide des modèles probabilistes latents et des réseaux profonds )

TLDR
It is demonstrated how adapting the transportation problem for estimating document distances one can achieve important improvements in the task of multi-class document classification and the use of word embeddings and neural networks for three text mining applications is demonstrated.

Towards Better Topic Models For Contemporary Textual Documents Of Social Media

TLDR
This thesis focuses on improving the performance of topic models for contemporary documents and text corpus generated in the form of social media posts and discusses and proposes an improvement to previous efforts in the direction and introduces a new algorithm, sentence2cluster, which also helps in document categorization and organization.

Comparison of Embedding Techniques for Topic Modeling Coherence Measures

TLDR
This work evaluates the difference between two popular word embedding algorithms and their variants, using two distinct external reference corpora, to discover if these underlying choices have a substantial impact on the resulting coherence scores.

Top2Vec: Distributed Representations of Topics

TLDR
This model does not require stop-word lists, stemming or lemmatization, and it automatically finds the number of topics, and the resulting topic vectors are jointly embedded with the document and word vectors with distance between them representing semantic similarity.
...

References

SHOWING 1-10 OF 30 REFERENCES

Visualizing Topics with Multi-Word Expressions

TLDR
A new method for visualizing topics, the distributions over terms that are automatically extracted from large text corpora using latent variable models, based on a language model of arbitrary length expressions, which outperforms the more standard use of $\chi^2$ and likelihood ratio tests.

Latent Dirichlet Allocation

Reading Tea Leaves: How Humans Interpret Topic Models

TLDR
New quantitative methods for measuring semantic meaning in inferred topics are presented, showing that they capture aspects of the model that are undetected by previous measures of model quality based on held-out likelihood.

Finding scientific topics

  • T. GriffithsM. Steyvers
  • Computer Science
    Proceedings of the National Academy of Sciences of the United States of America
  • 2004
TLDR
A generative model for documents is described, introduced by Blei, Ng, and Jordan, and a Markov chain Monte Carlo algorithm is presented for inference in this model, which is used to analyze abstracts from PNAS by using Bayesian model selection to establish the number of topics.

Probabilistic Topic Models

  • D. Blei
  • Computer Science
    IEEE Signal Processing Magazine
  • 2010
TLDR
Surveying a suite of algorithms that offer a solution to managing large document archives suggests they are well-suited to handle large amounts of data.

Latent semantic indexing: a probabilistic analysis

TLDR
It is proved that under certain conditions LSI does succeed in capturing the underlying semantics of the corpus and achieves improved retrieval performance.

A vector space model for automatic indexing

TLDR
An approach based on space density computations is used to choose an optimum indexing vocabulary for a collection of documents, demonstating the usefulness of the model.

Nieme: Large-Scale Energy-Based Models

TLDR
NIEME, a machine learning library for large-scale classification, regression and ranking relies on the framework of energy-based models which unifies several learning algorithms ranging from simple perceptrons to recent models such as the pegasos support vector machine or l1-regularized maximum entropy models.

Modular Toolkit for Data Processing (MDP): A Python Data Processing Framework

TLDR
The modular toolkit for Data Processing is a collection of supervised and unsupervised learning algorithms and other data processing units that can be combined into data processing sequences and more complex feed-forward network architectures.

Pairwise Document Similarity in Large Collections with MapReduce

TLDR
This paper presents a MapReduce algorithm for computing pairwise document similarity in large document collections that exhibits linear growth in running time and space in terms of the number of documents.