Dirichlet Process Mixture Models for Modeling and Generating Synthetic Versions of Nested Categorical Data

  title={Dirichlet Process Mixture Models for Modeling and Generating Synthetic Versions of Nested Categorical Data},
  author={Jingchen Hu and Jerome P. Reiter and Quanli Wang},
  journal={arXiv: Methodology},
We present a Bayesian model for estimating the joint distribution of multivariate categorical data when units are nested within groups. Such data arise frequently in social science settings, for example, people living in households. The model assumes that (i) each group is a member of a group-level latent class, and (ii) each unit is a member of a unit-level latent class nested within its group-level latent class. This structure allows the model to capture dependence among units in the same… 

Figures and Tables from this paper

Simultaneous Edit and Imputation For Household Data with Structural Zeros

A model-based engine for editing and imputation of household data based on a Bayesian hierarchical model that propagates uncertainty due to unknown locations of errors and missing values, generates plausible datasets that satisfy all edit constraints, and can preserve multivariate relationships within and across individuals in the same household is presented.


A procedure for specifying which variables with low rates of missingness to include in the focus set is presented, and the performance of the imputation procedure is examined using simulation studies based on artificial data and on data from the American Community Survey.

Efficient Bayesian Nonparametric Inference for Categorical Data with General High Missingness

A Bayesian nonparametric approach, the Dirichlet Process Mixture of Collapsed Product-Multinomials (DPMCPM) is developed, which can model general missing mechanisms by creating an extra category to denote missingness, which implicitly integrates out the missing part with regard to their true conditional distribution.

Nonparametric statistical inference and imputation for incomplete categorical data

Under the framework of latent class analysis, DPMCPM can model general missing mechanisms by creating an extra category to denote missingness, which implicitly integrates out the missing part with regard to their true conditional distribution.

Bayesian latent class models for the multiple imputation of cross-sectional, multilevel and longitudinal categorical data

Novel models that allow to obtain imputing values which replace the missing data in a dataset are proposed, and results show that Latent Class models are among the best performing ones for multiple imputation, and are therefore the recommended approach for the applied researchers who wish to perform statistical analysis in the presence of missing categorical data.

A Comparative Study of Imputation Methods for Multivariate Ordinal Data.

An empirical evaluation of several MI methods using simulation studies based on ordinal variables selected from the 2018 American Community Survey (ACS), suggesting that MI using proportional odds logistic regression models, classification and regression trees and DP mixtures of multinomial distributions generally outperform the other methods.

Multiple Imputation of Missing Values in Household Data with Structural Zeros

We present an approach for imputation of missing items in multivariate categorical data nested within households. The approach relies on a latent class model that (i) allows for household level and

MCMC Sampling Estimation of Poisson-Dirichlet Process Mixture Models

In this article, we aim to estimate the parameters of Poisson-Dirichlet mixture model with multigroup data structure by empirical Bayes. The number of mixture components with Bayesian nonparametric

Bayesian Data Synthesis and Disclosure Risk Quantification: An Application to the Consumer Expenditure Surveys

This work develops innovative disclosure risks measures to quantify inherent risks in the original CE data and how those data risks are ameliorated by two non-parametric Bayesian models as data synthesizers for the county identifier of each data record.

A hierarchical mixture modeling framework for population synthesis



Latent class and finite mixture models for multilevel data sets

  • J. Vermunt
  • Mathematics
    Statistical methods in medical research
  • 2008
An extension of latent class (LC) and finite mixture models is described for the analysis of hierarchical data sets and an adapted version of the expectation—maximization algorithm that can be used for maximum likelihood estimation is described.

Multiple Imputation of Missing Categorical and Continuous Values via Bayesian Mixture Models With Local Dependence

A nonparametric Bayesian joint model for multivariate continuous and categorical variables and imputations based on the proposed model tend to have better repeated sampling properties than the default application of chained equations in this realistic setting.

Bayesian Estimation of Discrete Multivariate Latent Structure Models With Structural Zeros

An approach for estimating posterior distributions in Bayesian latent structure models with potentially many structural zeros is presented, and an algorithm for collapsing a large set of structural zero combinations into a much smaller set of disjoint marginal conditions, which speeds up computation.

Incorporating Marginal Prior Information in Latent Class Models

An approach to incorporating informative prior beliefs about marginal probabilities into Bayesian latent class models for categorical data using a variety of simulations based on data from the American Community Survey is presented.

Exploratory latent structure analysis using both identifiable and unidentifiable models

SUMMARY This paper considers a wide class of latent structure models. These models can serve as possible explanations of the observed relationships among a set of m manifest polytomous variables. The

Nonparametric Bayesian Multiple Imputation for Incomplete Categorical Variables in Large-Scale Assessment Surveys

This work presents a fully Bayesian, joint modeling approach to multiple imputation for categorical data based on Dirichlet process mixtures of multinomial distributions, which automatically models complex dependencies while being computationally expedient.

Disclosure Risk Evaluation for Fully Synthetic Categorical Data

This work uses a “worst-case” scenario of an intruder knowing all but one of the records in the confidential data to compute probability distributions of unknown confidential data values given the synthetic data and assumptions about intruder knowledge.

Micro–macro multilevel latent class models with multiple discrete individual-level variables

An existing micro–macro method for a single individual-level variable is extended to the multivariate situation by presenting two multilevel latent class models in which multiple discrete

Nonparametric Bayes modeling with sample survey weights.

Domain-Level Covariance Analysis for Multilevel Survey Data With Structured Nonresponse

This work analyzes relationships among quality measures at the domain level using generalized variance–covariance functions and compares ML estimates of this factor structure with those from several Bayesian models with different prior distributions for the between-domain covariance.