• Corpus ID: 88522995

Clustering and Model Selection via Penalized Likelihood for Different-sized Categorical Data Vectors

  title={Clustering and Model Selection via Penalized Likelihood for Different-sized Categorical Data Vectors},
  author={Esther Derman and Erwan Le Pennec},
  journal={arXiv: Statistics Theory},
In this study, we consider unsupervised clustering of categorical vectors that can be of different size using mixture. We use likelihood maximization to estimate the parameters of the underlying mixture model and a penalization technique to select the number of mixture components. Regardless of the true distribution that generated the data, we show that an explicit penalty, known up to a multiplicative constant, leads to a non-asymptotic oracle inequality with the Kullback-Leibler divergence on… 

Figures from this paper

Minimal penalties and the slope heuristics: a survey
The theoretical results obtained for minimal-penalty algorithms are reviewed, with a self-contained proof in the simplest framework, precise proof ideas for further generalizations, and a few new results.
Spatio-temporal mixture process estimation to detect dynamical changes in population
Pénalités minimales et heuristique de pente
Birge et Massart ont propose en 2001 l'heuristique de pente, pour determiner a l'aide des donnees une constante multiplicative optimale devant une penalite en selection de modeles. Cette heuristique


Variable selection in model-based clustering for high-dimensional data
A variable selection procedure for clustering suited to high-dimensional contexts that provides l1-oracle inequalities for the Lasso in the regression framework and establishes a model selection theorem for maximum likelihood estimators in a density estimation framework with a random model collection.
Clustering and variable selection for categorical multivariate data
This article investigates unsupervised classification techniques for categorical multivariate data. The study employs multivariate multinomial mixture modeling, which is a type of model particularly
Slope heuristics for variable selection and clustering via Gaussian mixtures
A "slope heuristics" method is proposed and experimented to deal with this practical problem in this context and numerical experiments on simulated datasets, a curve clustering example and a genomics application highlight the interest of the proposed heuristic.
Identifying the number of clusters in discrete mixture models
A new approach in which clustering of categorical data and the estimation of the number of clusters is carried out simultaneously, and the proposed EM-MML approach seamlessly integrates estimation and model selection in a single algorithm.
Conditional Density Estimation by Penalized Likelihood Model Selection and Applications
In this technical report, we consider conditional density estimation with a maximum likelihood approach. Under weak assumptions, we obtain a theoretical bound for a Kullback-Leibler type loss for a
Partition-based conditional density estimation
A general partition-based strategy to estimate conditional density with candidate densities that are piecewise constant with respect to the covariate is proposed and it is proved that the penalty of each model can be chosen roughly proportional to its dimension.
A non asymptotic penalized criterion for Gaussian mixture model selection
The ordered and non-ordered variable selection cases are both addressed in this paper and a general model selection theorem for MLE is used to obtain the penalty function form.
Mixture of Gaussian regressions model with logistic weights, a penalized maximum likelihood approach
A lower bound on the penalty that ensures an oracle inequality for the authors' estimator is provided, which aims at estimating the number of components of this mixture of Gaussian regressions by a penalized maximum likelihood approach.
Inference and evaluation of the multinomial mixture model for text clustering
Efficient semiparametric estimation and model selection for multidimensional mixtures
In this paper, we consider nonparametric multidimensional finite mixture models and we are interested in the semiparametric estimation of the population weights. Here, the i.i.d. observations are