Corpus ID: 18101489

jLDADMM: A Java package for the LDA and DMM topic models

@article{Nguyen2018jLDADMMAJ,
  title={jLDADMM: A Java package for the LDA and DMM topic models},
  author={Dat Quoc Nguyen},
  journal={ArXiv},
  year={2018},
  volume={abs/1808.03835}
}
In this technical report, we present jLDADMM---an easy-to-use Java toolkit for conventional topic models. [...] Key Method It provides implementations of the Latent Dirichlet Allocation topic model and the one-topic-per-document Dirichlet Multinomial Mixture model (i.e. mixture of unigrams), using collapsed Gibbs sampling. In addition, jLDADMM supplies a document clustering evaluation to compare topic models. jLDADMM is open-source and available to download at: this https URLExpand
Evaluation of the Dirichlet Process Multinomial Mixture Model for Short-Text Topic Modeling
TLDR
It is shown that the Dirichlet Process Multinomial Mixture model is a viable option for short text topic modeling since it on average performs better, or nearly as good, compared to the parametric alternatives, while reducing parameter setting requirements and thereby eliminates the need of expensive preprocessing. Expand
Review and Implementation of Topic Modeling in Hindi
TLDR
The challenges faced in developing topic models for Hindi are discussed and the results of Topic modeling in Hindi seem to be promising and comparable to some results reported in the literature on English datasets. Expand
Modeling Topic Evolution in Twitter: An Embedding-Based Approach
TLDR
A word embedding-based approach to analyze users-centric tweets to observe their behavior evolution in terms of the topics discussed by them over a period of time and a word embedded-based proximity measure to monitor temporal transitions between the topics. Expand
Semantic text alignment based on topic modeling
TLDR
Experiments with PAN corpora gave much higher recalls and approximate plagdets compared to the winning system in PAN2014, and shows that topic modeling is a potential solution for detecting intelligent plagiarism. Expand
Characterizing Twitter Discussions About HPV Vaccines Using Topic Modeling and Community Detection
TLDR
The use of community detection in concert with topic modeling appears to be a useful way to characterize Twitter communities for the purpose of opinion surveillance in public health applications. Expand
Open information extraction as an intermediate semantic structure for Persian text summarization
TLDR
A novel application of Open IE as an intermediate layer for text summarization that is able to break the structure of the sentence and extract the most significant sub-sentential elements and can be adopted to other languages. Expand
A Machine Learning Approach for Semi-Automated Search and Selection in Literature Studies
TLDR
This work demonstrates with a proof-of-concept tool that the proposed automated search and selection approach generates valid search strings and that the performance for subsets of primary studies can reduce the manual work by half. Expand
Arc Summarization of Tv Series
In this dissertation, we aim to create a system capable of generating summaries of arcs of TV series. With thousands of hours of video being uploaded and stored in videosharing websites and onlineExpand
Topic Modelling for Identification of Vaccine Reactions in Twitter
TLDR
The study compared Gensim LDA, MALLET, and jLDADMM DMM models to determine the most effective model for detecting vaccine safety signals, assisted by an evaluation process that used an adjusted F-Scoring technique over a labelled subset of the documents. Expand
Modeling cancer clinical trials using HL7 FHIR to support downstream applications: A case study with colorectal cancer data
TLDR
CRFs can be considered as a proxy for representing information needs for their respective cancer types and can serve as a valuable resource for expanding existing standards to ensure they can comprehensively represent relevant clinical data without loss of granularity. Expand
...
1
2
...

References

SHOWING 1-10 OF 18 REFERENCES
Latent Dirichlet Allocation
We propose a generative model for text and other collections of discrete data that generalizes or improves on several previous models including naive Bayes/unigram, mixture of unigrams [6], andExpand
Probabilistic Topic Models
In this article, we review probabilistic topic models: graphical models that can be used to summarize a large collection of documents with a smaller number of distributions over words. ThoseExpand
A general framework to expand short text for topic modeling
TLDR
A framework to generate pseudo-documents suitable for topic modeling of short text by creating larger pseudo-document representations from the original documents is proposed and two simple, effective and efficient methods that specialize the general framework to create larger Pseudo-Documents are presented. Expand
Improving Topic Models with Latent Feature Word Representations
TLDR
Two different Dirichlet multinomial topic models are extended by incorporating latent feature vector representations of words trained on very large corpora to improve the word-topic mapping learnt on a smaller corpus. Expand
Optimizing Semantic Coherence in Topic Models
TLDR
A novel statistical topic model based on an automated evaluation metric based on this metric that significantly improves topic quality in a large-scale document collection from the National Institutes of Health (NIH). Expand
Understanding the Limiting Factors of Topic Modeling via Posterior Contraction Analysis
TLDR
This paper presents theorems elucidating the posterior contraction rates of the topics as the amount of data increases, and a thorough supporting empirical study using synthetic and real data sets, including news and web-based articles and tweet messages. Expand
A dirichlet multinomial mixture model-based approach for short text clustering
TLDR
This paper proposed a collapsed Gibbs Sampling algorithm for the Dirichlet Multinomial Mixture model for short text clustering and found that GSDMM can infer the number of clusters automatically with a good balance between the completeness and homogeneity of the clustering results, and is fast to converge. Expand
Improving LDA topic models for microblogs via tweet pooling and automatic labeling
TLDR
This paper empirically establishes that a novel method of tweet pooling by hashtags leads to a vast improvement in a variety of measures for topic coherence across three diverse Twitter datasets in comparison to an unmodified LDA baseline and a range of pooling schemes. Expand
Finding scientific topics
  • T. Griffiths, M. Steyvers
  • Computer Science, Medicine
  • Proceedings of the National Academy of Sciences of the United States of America
  • 2004
TLDR
A generative model for documents is described, introduced by Blei, Ng, and Jordan, and a Markov chain Monte Carlo algorithm is presented for inference in this model, which is used to analyze abstracts from PNAS by using Bayesian model selection to establish the number of topics. Expand
Investigating task performance of probabilistic topic models: an empirical study of PLSA and LDA
TLDR
This paper conducts a systematic investigation of two representative probabilistic topic models, probabilistically latent semantic analysis (PLSA) and Latent Dirichlet Allocation (LDA), using three representative text mining tasks, including document clustering, text categorization, and ad-hoc retrieval. Expand
...
1
2
...