Nonparametric cluster significance testing with reference to a unimodal null distribution

@article{Helgeson2020NonparametricCS,
  title={Nonparametric cluster significance testing with reference to a unimodal null distribution},
  author={Erika S Helgeson and David M. Vock and Eric Bair},
  journal={Biometrics},
  year={2020},
  volume={77},
  pages={1215 - 1226}
}
Cluster analysis is an unsupervised learning strategy that is exceptionally useful for identifying homogeneous subgroups of observations in data sets of unknown structure. However, it is challenging to determine if the identified clusters represent truly distinct subgroups rather than noise. Existing approaches for addressing this problem tend to define clusters based on distributional assumptions, ignore the inherent correlation structure in the data, or are not suited for high‐dimension low… 
2 Citations
Clustering inference in multiple groups
TLDR
This work presents a U-statistics based approach, specially tailored for highdimensional data, that clusters the data into three groups while assessing the significance of such partitions while developing its asymptotic theory.

References

SHOWING 1-10 OF 48 REFERENCES
Statistical Significance of Clustering Using Soft Thresholding
  • Hanwen Huang, Yufeng Liu, M. Yuan, J. Marron
  • Computer Science
    Journal of computational and graphical statistics : a joint publication of American Statistical Association, Institute of Mathematical Statistics, Interface Foundation of North America
  • 2015
TLDR
It is shown that the original eigenvalue estimation can lead to a test that suffers from severe inflation of Type I error, in the important case where there are a few very large eigenvalues, which leads to a much improved SigClust.
Statistical Significance of Clustering for High-Dimension, Low–Sample Size Data
TLDR
Simulated examples, as well as an application to a real cancer microarray data set, show that the proposed SigClust method works remarkably well for assessing significance of clustering.
Detecting the Presence of Mixing with Multiscale Maximum Likelihood
A test of homogeneity tries to decide whether observations come from a single distribution or from a mixture of several distributions. A powerful theory has been developed for the case where the
A semiparametric method for clustering mixed data
TLDR
KAMILA (KAy-means for MIxed LArge data) is developed, a clustering method that addresses this fundamental problem directly and is shown to be effective in a series of Monte Carlo simulation studies and a set of real-world applications.
UNIMODAL DENSITY ESTIMATION USING KERNEL METHODS
TLDR
It is proposed that the amount of tilting be chosen in order to minimise, subject to unimodality, the integrated squared distance between a conventional density estimator and its tilted version, and it is shown that in classes of densities that are of practical interest, the method enhances performance without suffering any deleterious first-order impact on asymptotic performance.
Clustering and classification problems in genetics through U-statistics
TLDR
A statistical test to assess group homogeneity taking into account multiple testing issues and a clustering algorithm based on dissimilarities within and between groups that highly speeds up the homogeneity test are proposed.
Identification of relevant subtypes via preweighted sparse clustering
Bootstrapping for Significance of Compact Clusters in Multidimensional Datasets
TLDR
A bootstrap approach for assessing significance in the clustering of multidimensional datasets and a viable approach for determining the minimal and optimal numbers of colors needed to display an image without significant loss in resolution is proposed.
Are clusters found in one dataset present in another dataset?
TLDR
The connection between reproducibility and prediction accuracy is taken advantage to develop a validation procedure for clusters found in datasets independent of the one in which they were characterized and the IGP is the best measure of prediction accuracy.
New Insights and Faster Computations for the Graphical Lasso
TLDR
A very simple necessary and sufficient condition can be employed to determine whether the estimated inverse covariance matrix will be block diagonal, and if so, then to identify the blocks in the graphical lasso solution.
...
1
2
3
4
5
...