Evaluating the Effects of Missing Values and Mixed Data Types on Social Sequence Clustering Using t-SNE Visualization

  title={Evaluating the Effects of Missing Values and Mixed Data Types on Social Sequence Clustering Using t-SNE Visualization},
  author={Alina Lazar and Ling Jin and C. Anna Spurlock and Kesheng Wu and Alex Sim and Annika Todd},
  journal={Journal of Data and Information Quality (JDIQ)},
  pages={1 - 22}
The goal of this work is to investigate the impact of missing values in clustering joint categorical social sequences. Identifying patterns in sociodemographic longitudinal data is important in a number of social science settings. However, performing analytical operations, such as clustering on life course trajectories, is challenging due to the categorical and multidimensional nature of the data, their mixed data types, and corruption by missing and inconsistent values. Data quality issues… Expand
Context-Based Evaluation of Dimensionality Reduction Algorithms—Experiments and Statistical Significance Analysis
This article has compiled 12 state-of-the-art quality metrics and categorized them into 5 identified analytical contexts and presents practitioners’ guidelines for the selection of an appropriate dimensionally reduction algorithm in the present analytical contexts. Expand
Clustering Life Course to Understand the Heterogeneous Effects of Life Events, Gender, and Generation on Habitual Travel Modes
It is found that events occurring relatively early in life are more strongly associated with changes in mode-use behavior, and that mode use can also be affected by the relative order of events. Expand
Machine Learning for Prediction of Mid to Long Term Habitual Transportation Mode Use
This paper combines sequence clustering and tree-based machine learning methods coupled with TreeExplainer to predict and interpret habitual travel modes using mid-to long-term predictors and demonstrates a promising step toward interpretable machine learning applications to mid- to long- term prediction of travel modes for transportation planning. Expand
Sharing (mis) information on social networking sites. An exploration of the norms for distributing content authored by others
This article explores the norms that govern regular users’ acts of sharing content on social networking sites. Many debates on how to counteract misinformation on Social Networking Sites focus on theExpand


1. Multichannel Sequence Analysis Applied to Social Science Data
Applications of optimal matching analysis in the social sciences are typically based on sequences of specific social statuses that model the residential, family, or occupational trajectories ofExpand
This article proposes the method of multichannel sequence analysis (MCSA), which simultaneously extends the usual optimal matching analysis (OMA) to multiple life spheres and finds that MCSA offers an alternative to the sole use of ex-post sum of distance matrices by locally aligning distinct life trajectories simultaneously. Expand
Comparing latent class and dissimilarity based clustering for mixed type variables with application to social stratification
Data with mixed type (metric/ordinal/nominal) variables can be clustered by a latent class mixture model approach, which assumes local independence. Such data are typical in social stratification,Expand
What matters in differences between life trajectories: a comparative review of sequence dissimilarity measures
The study shows that there is no universally optimal distance index, and that the choice of a measure depends on which aspect the authors want to focus on, and introduces novel ways of measuring dissimilarities that overcome some flaws in existing measures. Expand
Multiple Imputation for Life-Course Sequence Data
As holistic analysis of life-course sequences becomes more common, using optimal matching (OM) and other approaches the problem of missing data becomes more serious. Longitudinal data is prone toExpand
Discrepancy Analysis of State Sequences
In this article, the authors define a methodological framework for analyzing the relationship between state sequences and covariates. Inspired by the principles of analysis of variance, this approachExpand
Clustering work and family trajectories by using a divisive algorithm
Summary. We present an approach to the construction of clusters of life course trajectories and use it to obtain ideal types of trajectories that can be interpreted and analysed meaningfully. WeExpand
WeightedCluster Library Manual A practical guide to creating typologies of trajectories in the social sciences with R
This manual presents the WeightedCluster library and offers a step-by-step guide to creating typologies of sequences for the social sciences, and shows that these methods offer an important descriptive point of view on sequences by bringing to light recurrent patterns. Expand
Estimating the Relationship between Time-varying Covariates and Trajectories: The Sequence Analysis Multistate Model Procedure
The relationship between processes and time-varying covariates is of central theoretical interest in addressing many social science research questions. On the one hand, event history analysis (EHA)Expand
Exploring sequences: a graphical tool based on multi-dimensional scaling
A new tool for the graphical exploratory analysis of sequences is introduced and how these plots can be used to gain insights about the main features of sequences and about the relationships between sequences and external information is described. Expand