Penalized and weighted K-means for clustering with scattered objects and prior information in high-throughput biological data
@article{Tseng2007PenalizedAW,
title={Penalized and weighted K-means for clustering with scattered objects and prior information in high-throughput biological data},
author={George C. Tseng},
journal={Bioinformatics},
year={2007},
volume={23 17},
pages={
2247-55
}
}MOTIVATION
Cluster analysis is one of the most important data mining tools for investigating high-throughput biological data. [] Key Method Two major extensions from K-means are involved: penalization and weighting. The additive penalty term is used to allow a set of scattered objects without being clustered. Weights are introduced to account for prior information of preferred or prohibited cluster patterns to be identified. Their relationship with the classification likelihood of Gaussian mixture models is…
Figures, Tables, and Topics from this paper
106 Citations
Object Weighting: A New Clustering Approach to Deal with Outliers and Cluster Overlap in Computational Biology
- Computer ScienceIEEE/ACM Transactions on Computational Biology and Bioinformatics
- 2021
A new general data partitioning method that includes an object-weighting step to assign higher weights to outliers and objects that cause cluster overlap, which largely outperforms X-means, DAPC and Prediction Strength as well as the K-mean algorithm based on feature weighting.
Simultaneous Estimation of Number of Clusters and Feature Sparsity in Clustering High-Dimensional Data
- Computer Science
- 2019
A resampling method that achieves better clustering accuracy with fewer selected predictive genes in almost all real applications and performs among the best over classical methods in estimating K in low-dimensional data.
Penalized model-based clustering with unconstrained covariance matrices.
- Computer ScienceElectronic journal of statistics
- 2009
This article proposes a regularized Gaussian mixture model permitting a treatment of general covariance matrices, taking various dependencies into account, and derives an E-M algorithm utilizing the graphical lasso for parameter estimation, achieving better clustering and variable selection.
Finding reproducible cluster partitions for the k-means algorithm
- Computer Science, MathematicsBMC Bioinformatics
- 2013
Stability measures previously presented in the context of finding optimal values of cluster number are extended into a component of a 2-d map of the local minima found by the k-means algorithm, from which not only can values of k be identified for further analysis but, more importantly, it is made clear whether the best SSQ is a suitable solution or whether obtaining a consistently good partition requires further application of the stability index.
Ensemble Clustering for Biological Datasets
- Computer Science
- 2012
Clustering is an unsupervised learning technique used in diverse domains including bioinformatics to obtain biologically meaningful partitions and there is no best clustering approach for the problem on hand and clustering algorithms are biased towards certain criteria.
Dynamically weighted clustering with noise set
- BiologyBioinform.
- 2010
A new clustering algorithm, Dynamically Weighted Clustering with Noise set (DWCN), which makes use of gene annotation information and allows for a set of scattered genes, the noise set, to be left out of the main clusters.
Normalized EM algorithm for tumor clustering using gene expression data
- Computer Science2008 8th IEEE International Conference on BioInformatics and BioEngineering
- 2008
A novel normalized Expectation-Maximization (EM) algorithm is proposed that is stable even with random initializations for its EM iterative procedure and is the first mixture model-based clustering algorithm that is shown to be stable when working directly with very high dimensional microarray data sets in the sample clustering problem.
A sparse negative binomial mixture model for clustering RNA-seq count data
- Computer ScienceBiostatistics
- 2021
A negative binomial mixture model with lasso or fused lasso gene regularization to cluster samples with high-dimensional gene features with superior performance in clustering accuracy, feature selection, and biological interpretation in pathways is developed.
Solution path clustering with adaptive concave penalty
- Computer Science
- 2014
A new clustering methodology that introduces the idea of a regularization path into unsupervised learning and is capable of simultaneously separating irrelevant or noisy observations that show no grouping pattern, which can greatly improve data interpretation.
Solution Path Clustering with Minimax Concave Penalty and Its Applications to Noisy Big Data
- Computer Science
- 2014
A new clustering methodology that introduces the idea of aregularization path into unsupervised learning, preserves the ability to isolate noise, includes a solution selection mechanism that ultimately provides one clustering solution with an estimated number of clusters, and is shown to be able to extract small tight clusters from noisy data.
References
SHOWING 1-10 OF 65 REFERENCES
Model-Based Clustering, Discriminant Analysis, and Density Estimation
- Computer Science
- 2002
This work reviews a general methodology for model-based clustering that provides a principled statistical approach to important practical questions that arise in cluster analysis, such as how many clusters are there, which clustering method should be used, and how should outliers be handled.
Tight clustering: a resampling-based approach for identifying stable and tight patterns in data.
- Computer ScienceBiometrics
- 2005
A method for clustering that produces tight and stable clusters without forcing all points into clusters is proposed and applied to analyze a set of expression profiles in the study of embryonic stem cells.
Model-based clustering and data transformations for gene expression data
- Computer ScienceBioinform.
- 2001
The model-based approach has superior performance on synthetic data sets, consistently selecting the correct model and the number of clusters, and the validity of the Gaussian mixture assumption on different transformations of real data is explored.
Evaluation and comparison of gene clustering methods in microarray analysis
- Computer ScienceBioinform.
- 2006
The results show that tight clustering and model-based clustering consistently outperform other clustering methods both in simulated and real data while hierarchical clusters and SOM perform among the worst.
A prediction-based resampling method for estimating the number of clusters in a dataset
- BiologyGenome Biology
- 2002
A new prediction-based resampling method, Clest, is developed, to estimate the number of clusters in a dataset, and was generally found to be more accurate and robust than the six existing methods considered in the study.
Incorporating biological knowledge into distance-based clustering analysis of microarray gene expression data
- Computer ScienceBioinform.
- 2006
This work proposes incorporating known gene functions into a new distance metric, which shrinks a gene expression-based distance towards 0 if and only if the two genes share a common gene function.
Bayesian infinite mixture model based clustering of gene expression profiles
- Computer ScienceBioinform.
- 2002
A clustering procedure based on the Bayesian infinite mixture model and applied to clustering gene expression profiles that allows for incorporation of uncertainties involved in the model selection in the final assessment of confidence in similarities of expression profiles.
CLICK and EXPANDER: a system for clustering and visualizing gene expression data
- Computer ScienceBioinform.
- 2003
A novel clustering algorithm, called CLICK, is presented, which utilizes graph-theoretic and statistical techniques to identify tight groups (kernels) of highly similar elements, which are likely to belong to the same true cluster.
A probabilistic framework for semi-supervised clustering
- Computer ScienceKDD
- 2004
A probabilistic model for semi-supervised clustering based on Hidden Markov Random Fields (HMRFs) that provides a principled framework for incorporating supervision into prototype-based clustering and experimental results demonstrate the advantages of the proposed framework.
A mixture model-based approach to the clustering of microarray expression data
- Computer ScienceBioinform.
- 2002
The usefulness of the EMMIX-GENE approach for the clustering of tissue samples is demonstrated on two well-known data sets on colon and leukaemia tissues, and relevant subsets of the genes are able to be selected that reveal interesting clusterings of the tissues that are either consistent with the external classified tissues or with background and biological knowledge of these sets.







