• Corpus ID: 222124965

Regularized K-means through hard-thresholding

@article{Raymaekers2020RegularizedKT,
  title={Regularized K-means through hard-thresholding},
  author={Jakob Raymaekers and Ruben H. Zamar},
  journal={ArXiv},
  year={2020},
  volume={abs/2010.00950}
}
We study a framework of regularized $K$-means methods based on direct penalization of the size of the cluster centers. Different penalization strategies are considered and compared through simulation and theoretical analysis. Based on the results, we propose HT $K$-means, which uses an $\ell_0$ penalty to induce sparsity in the variables. Different techniques for selecting the tuning parameter are discussed and compared. The proposed method stacks up favorably with the most popular regularized… 

References

SHOWING 1-10 OF 52 REFERENCES

Penalized Model-Based Clustering with Application to Variable Selection

TLDR
A penalized likelihood approach with an L1 penalty function is proposed, automatically realizing variable selection via thresholding and delivering a sparse solution in model-based clustering analysis with a common diagonal covariance matrix.

Performance of Johnson-Lindenstrauss transform for k-means and k-medians clustering

TLDR
This work shows that the cost of the optimal solution is preserved up to a factor of (1+ε) under a projection onto a random O(log(k /ε) / ε2)-dimensional subspace and that the bound on the dimension is nearly optimal.

Explainable k-Means and k-Medians Clustering

TLDR
It is shown that popular top-down decision tree algorithms may lead to clusterings with arbitrarily large cost, and that any tree-induced clustering must in general incur an $\Omega(\log k)$ approximation factor compared to the optimal clustering.

Degrees of Freedom and Model Selection for kmeans Clustering

A LASSO-penalized BIC for mixture model selection

TLDR
A LASSO-penalized BIC (LPBIC) is introduced to overcome the problem of overestimating or underestimating the number of components in higher dimensions and is shown to match or outperform the BIC in several situations.

Unsupervised Feature Selection for the $k$-means Clustering Problem

TLDR
It is proved that, if the authors run any γ-approximate k-means algorithm on the features selected using the algorithm presented, they can find a (1 + ( 1 + ∊) ≥)-approximates partition with high probability.

Consistent selection of the number of clusters via crossvalidation

TLDR
A novel selection criterion that is applicable to all kinds of clustering algorithms, including distance based or non-distance based algorithms, is proposed, which measures the robustness of any given clustering algorithm against the randomness in sampling.

Smoothly Clipped Absolute Deviation on High Dimensions

TLDR
An efficient optimization algorithm is developed that is fast and always converges to a local minimum and it is proved that the SCAD estimator still has the oracle property on high-dimensional problems.

Randomized Dimensionality Reduction for $k$ -Means Clustering

TLDR
The first provably accurate feature selection method for k-means clustering is presented and, in addition, two feature extraction methods are presented that improve upon the existing results in terms of time complexity and number of features needed to be extracted.

MODEL SELECTION FOR CORRELATED DATA WITH DIVERGING NUMBER OF PARAMETERS

TLDR
The proposed penalized quadratic inference function to perform model selection and estimation in the framework of a diverging number of regression parameters enjoys the oracle property; it is able to identify non-zero components consistently with probability tending to 1, and any finite linear combination of the estimated non- zero compo- nents has an asymptotic normal distribution.
...