DOLDA: a regularized supervised topic model for high-dimensional multi-class regression

@article{Magnusson2020DOLDAAR,
  title={DOLDA: a regularized supervised topic model for high-dimensional multi-class regression},
  author={Maans Magnusson and Leif Jonsson and Mattias Villani},
  journal={Computational Statistics},
  year={2020},
  volume={35},
  pages={175-201}
}
Generating user interpretable multi-class predictions in data-rich environments with many classes and explanatory covariates is a daunting task. We introduce Diagonal Orthant Latent Dirichlet Allocation (DOLDA), a supervised topic model for multi-class classification that can handle many classes as well as many covariates. To handle many classes we use the recently proposed Diagonal Orthant probit model (Johndrow et al., in: Proceedings of the sixteenth international conference on artificial… 
Twin labeled LDA: a supervised topic model for document classification
TLDR
This paper proposes a new supervised topic model for document classification problems, Twin Labeled LDA (TL-LDA), which has two sets of parallel topic modeling processes, one incorporates the prior label information by hierarchical Dirichlet distributions, the other models the grouping tags, which have prior knowledge about the label correlation.
Easy Variational Inference for Categorical Observations via a New View of Diagonal Orthant Probit Models
In pursuit of tractable Bayesian analysis of categorical data, auxiliary variable methods hold promise, but impose asymmetries on the truly unordered categories or spoil scalability via strong
Horseshoe Regularisation for Machine Learning in Complex and Deep Models 1
TLDR
The purpose of the current article is to demonstrate that the horseshoe regularization is useful far more broadly, by reviewing both methodological and computational developments in complex models that are more relevant to machine learning applications.
Bayesian Topic Regression for Causal Inference
TLDR
The Bayesian Topic Regression model that uses both text and numerical information to model an outcome variable allows estimation of both discrete and continuous treatment effects and allows for the inclusion of additional numerical confounding factors next to text data.
Labeled Phrase Latent Dirichlet Allocation and its online learning algorithm
TLDR
This paper proposes a novel topic model, called Labeled Phrase Latent Dirichlet Allocation (LPLDA), which regards each document as a mixture of phrases and partly considers the word order, and develops a batch inference algorithm based on Gibbs sampling technique for LPLDA.
Lasso Meets Horseshoe: A Survey
The goal of this paper is to contrast and survey the major advances in two of the most commonly used high-dimensional techniques, namely, the Lasso and horseshoe regularization. Lasso is a gold
Combined Digital Economic-Epidemic Model for the Evaluation of Economic Results of Several Scenarios of Quarantine Measures
TLDR
The offered model and settlements, performed on its basis, can be applied in all regions of Russia in order to select the potential option of the implementation of quarantine measures and of the evaluation of possible economic consequences for each region.
Expedience of Investing in the Intellectual Potential of an Enterprise Using the Example of PJSC Gazprom
One of the most important enterprise resources, ensuring its competitive advantage in the Russian and international markets, is its personnel with their knowledge, skills, smart ideas, effectively
Machine Learning-Based Bug Handling in Large-Scale Software Development
TLDR
This thesis investigates the possibilities of automating parts of the bug handling process in large-scale software development organizations with a view to automating the development of knowledge representation models.

References

SHOWING 1-10 OF 47 REFERENCES
Linear Time Samplers for Supervised Topic Models using Compositional Proposals
TLDR
This work extends the recent sampling advances for unsupervised LDA models to supervised tasks and focuses on the Gibbs MedLDA model that is able to simultaneously discover latent structures and make accurate predictions, and is believed to be the first linear time sampling algorithm for supervised topic models.
Improved Bayesian Logistic Supervised Topic Models with Data Augmentation
TLDR
This work addresses supervised topic models with a logistic likelihood by introducing a regularization constant to better balance the two parts based on an optimization formulation of Bayesian inference and developing a simple Gibbs sampling algorithm by introducing auxiliary Polya-Gamma variables and collapsing out Dirichlet variables.
Sparse Partially Collapsed MCMC for Parallel Inference in Topic Models
TLDR
A parallel sparse partially collapsed Gibbs sampler is proposed and compared and it is proved that the partially collapsed samplers scale well with the size of the corpus and can be used in more modeling situations than the ordinary collapsed sampler.
DiscLDA: Discriminative Learning for Dimensionality Reduction and Classification
TLDR
This paper presents DiscLDA, a discriminative variation on Latent Dirichlet Allocation in which a class-dependent linear transformation is introduced on the topic mixture proportions, and obtains a supervised dimensionality reduction algorithm that uncovers the latent structure in a document collection while preserving predictive power for the task of classification.
MedLDA: maximum margin supervised topic models
TLDR
The maximum entropy discrimination latent Dirichlet allocation (MedLDA) model is proposed, which integrates the mechanismbehind the max-margin prediction models with the mechanism behind the hierarchical Bayesian topic models under a unified constrained optimization framework, and yields latent topical representations that are more discriminative and more suitable for prediction tasks such as document classification or regression.
Optimizing Semantic Coherence in Topic Models
TLDR
A novel statistical topic model based on an automated evaluation metric based on this metric that significantly improves topic quality in a large-scale document collection from the National Institutes of Health (NIH).
Bayesian Inference for Logistic Models Using Pólya–Gamma Latent Variables
We propose a new data-augmentation strategy for fully Bayesian inference in models with binomial likelihoods. The approach appeals to a new class of Pólya–Gamma distributions, which are constructed
Latent Dirichlet Allocation
Diagonal Orthant Multinomial Probit Models
TLDR
A new class of diagonal orthant (DO) multinomial models with conditional independence of the latent variables given model parameters, avoidance of arbitrary identiability restrictions, and simple expressions for category probabilities is proposed.
Monte Carlo Methods for Maximum Margin Supervised Topic Models
TLDR
This paper develops two efficient Monte Carlo methods under much weaker assumptions for max-margin supervised topic models based on an importance sampler and a collapsed Gibbs sampler in a convex dual formulation.
...
1
2
3
4
5
...