• Corpus ID: 65153410

Efficient Bayesian Nonparametric Inference for Categorical Data with General High Missingness

  title={Efficient Bayesian Nonparametric Inference for Categorical Data with General High Missingness},
  author={Chaojie Wang and Linghao Shen and Han Li and Xiaodan Fan},
  journal={arXiv: Methodology},
Missingness in categorical data is a common problem in various real applications. Traditional approaches either utilize only the complete observations or impute the missing data by some ad hoc methods rather than the true conditional distribution of the missing data, thus losing or distorting the rich information in the partial observations. In this paper, we develop a Bayesian nonparametric approach, the Dirichlet Process Mixture of Collapsed Product-Multinomials (DPMCPM), to model the full… 

Figures and Tables from this paper



Nonparametric Bayesian Multiple Imputation for Incomplete Categorical Variables in Large-Scale Assessment Surveys

This work presents a fully Bayesian, joint modeling approach to multiple imputation for categorical data based on Dirichlet process mixtures of multinomial distributions, which automatically models complex dependencies while being computationally expedient.

Multiple Imputation of Missing Categorical and Continuous Values via Bayesian Mixture Models With Local Dependence

A nonparametric Bayesian joint model for multivariate continuous and categorical variables and imputations based on the proposed model tend to have better repeated sampling properties than the default application of chained equations in this realistic setting.

Bayesian Multilevel Latent Class Models for the Multiple Imputation of Nested Categorical Data

  • D. VidottoJ. VermuntKatrijn Van Deun
  • Computer Science
    Journal of educational and behavioral statistics : a quarterly publication sponsored by the American Educational Research Association and the American Statistical Association
  • 2018
Results indicate that the BMLC model is able to recover unbiased parameter estimates of the analysis models considered in the authors' studies, as well as to correctly reflect the uncertainty due to missing data, outperforming the competing methods.

Bayesian Simultaneous Edit and Imputation for Multivariate Categorical Data

A Bayesian hierarchical model is used that couples a stochastic model for the measurement error process with a Dirichlet process mixture of multinomial distributions for the underlying, error-free values and is restricted to have support only on the set of theoretically possible combinations.

MIMCA: multiple imputation for categorical variables with multiple correspondence analysis

The proposed method provides a good point estimate of the parameters of the analysis model considered, such as the coefficients of a main effects logistic regression model, and a reliable estimates of the variability of the estimators.


The proposed multiple imputation method, which is implemented in Latent GOLD software for latent class analysis, is illustrated with two examples and a comparison to well-established methods such as maximum likelihood is compared.

Nonparametric Bayes Modeling of Multivariate Categorical Data

  • D. DunsonC. Xing
  • Computer Science, Mathematics
    Journal of the American Statistical Association
  • 2012
This article develops a nonparametric Bayes approach, which defines a prior with full support on the space of distributions for multiple unordered categorical variables, and shows this can be accomplished through a Dirichlet process mixture of product multinomial distributions, which is also a convenient form for posterior computation.

Mixture analysis of multivariate categorical data with covariates and missing entries

Dirichlet Process Mixture Models for Modeling and Generating Synthetic Versions of Nested Categorical Data

We present a Bayesian model for estimating the joint distribution of multivariate categorical data when units are nested within groups. Such data arise frequently in social science settings, for

Review: a gentle introduction to imputation of missing values.