• Corpus ID: 88522995

Clustering and Model Selection via Penalized Likelihood for Different-sized Categorical Data Vectors

@article{Derman2017ClusteringAM,
  title={Clustering and Model Selection via Penalized Likelihood for Different-sized Categorical Data Vectors},
  author={Esther Derman and Erwan Le Pennec},
  journal={arXiv: Statistics Theory},
  year={2017}
}
In this study, we consider unsupervised clustering of categorical vectors that can be of different size using mixture. We use likelihood maximization to estimate the parameters of the underlying mixture model and a penalization technique to select the number of mixture components. Regardless of the true distribution that generated the data, we show that an explicit penalty, known up to a multiplicative constant, leads to a non-asymptotic oracle inequality with the Kullback-Leibler divergence on… 

Figures from this paper

Spatio-temporal mixture process estimation to detect dynamical changes in population
Pénalités minimales et heuristique de pente
Birge et Massart ont propose en 2001 l'heuristique de pente, pour determiner a l'aide des donnees une constante multiplicative optimale devant une penalite en selection de modeles. Cette heuristique
Minimal penalties and the slope heuristics: a survey
TLDR
The theoretical results obtained for minimal-penalty algorithms are reviewed, with a self-contained proof in the simplest framework, precise proof ideas for further generalizations, and a few new results.

References

SHOWING 1-10 OF 23 REFERENCES
Variable selection in model-based clustering for high-dimensional data
TLDR
A variable selection procedure for clustering suited to high-dimensional contexts that provides l1-oracle inequalities for the Lasso in the regression framework and establishes a model selection theorem for maximum likelihood estimators in a density estimation framework with a random model collection.
Clustering and variable selection for categorical multivariate data
This article investigates unsupervised classification techniques for categorical multivariate data. The study employs multivariate multinomial mixture modeling, which is a type of model particularly
Slope heuristics for variable selection and clustering via Gaussian mixtures
TLDR
A "slope heuristics" method is proposed and experimented to deal with this practical problem in this context and numerical experiments on simulated datasets, a curve clustering example and a genomics application highlight the interest of the proposed heuristic.
Identifying the number of clusters in discrete mixture models
TLDR
A new approach in which clustering of categorical data and the estimation of the number of clusters is carried out simultaneously, and the proposed EM-MML approach seamlessly integrates estimation and model selection in a single algorithm.
Partition-based conditional density estimation
TLDR
A general partition-based strategy to estimate conditional density with candidate densities that are piecewise constant with respect to the covariate is proposed and it is proved that the penalty of each model can be chosen roughly proportional to its dimension.
A non asymptotic penalized criterion for Gaussian mixture model selection
TLDR
The ordered and non-ordered variable selection cases are both addressed in this paper and a general model selection theorem for MLE is used to obtain the penalty function form.
Inference and evaluation of the multinomial mixture model for text clustering
Efficient semiparametric estimation and model selection for multidimensional mixtures
In this paper, we consider nonparametric multidimensional finite mixture models and we are interested in the semiparametric estimation of the population weights. Here, the i.i.d. observations are
Unsupervised Learning of Finite Mixture Models
TLDR
The novelty of the approach is that it does not use a model selection criterion to choose one among a set of preestimated candidate models; instead, it seamlessly integrate estimation and model selection in a single algorithm.
Poisson Random Fields for Dynamic Feature Models
TLDR
A new framework for generating dependent Indian buffet processes is established, where the Poisson random field model from population genetics is used as a way of constructing dependent beta processes.
...
...