# Efficient sparse spherical k-means for document clustering

@article{Knittel2021EfficientSS, title={Efficient sparse spherical k-means for document clustering}, author={Johannes Knittel and Steffen Koch and Thomas Ertl}, journal={Proceedings of the 21st ACM Symposium on Document Engineering}, year={2021} }

Spherical k-Means is frequently used to cluster document collections because it performs reasonably well in many settings and is computationally efficient. However, the time complexity increases linearly with the number of clusters k, which limits the suitability of the algorithm for larger values of k depending on the size of the collection. Optimizations targeted at the Euclidean k-Means algorithm largely do not apply because the cosine distance is not a metric. We therefore propose an…

## One Citation

### Document clustering

- Computer ScienceWIREs Computational Statistics
- 2022

The main clustering algorithms used for text data are critically analyzed, considering prototype‐based, graph‐ based, hierarchical, and model‐based approaches.

## References

SHOWING 1-10 OF 12 REFERENCES

### Concept Decompositions for Large Sparse Text Data Using Clustering

- Computer ScienceMachine Learning
- 2004

The concept vectors produced by the spherical k-means algorithm constitute a powerful sparse and localized “basis” for text data sets and are localized in the word space, are sparse, and tend towards orthonormality.

### Using the Triangle Inequality to Accelerate k-Means

- Computer ScienceICML
- 2003

The accelerated k-means algorithm is shown how to accelerate dramatically, while still always computing exactly the same result as the standard algorithm, and is effective for datasets with up to 1000 dimensions, and becomes more and more effective as the number k of clusters increases.

### Streaming k-means approximation

- Computer ScienceNIPS
- 2009

A clustering algorithm that approximately optimizes the k-means objective, in the one-pass streaming setting, which is applicable to unsupervised learning on massive data sets, or resource-constrained devices.

### Exact and Approximate Maximum Inner Product Search with LEMP

- Computer ScienceACM Trans. Database Syst.
- 2017

An extensive experimental study provides insight into the performance of many state-of-the-art techniques—including LEMP—on multiple real-world datasets and found that LEMP often was significantly faster or more accurate than alternative methods.

### Streaming k-means on well-clusterable data

- Computer ScienceSODA '11
- 2011

A near-optimal streaming approximation algorithm for k-means in high-dimensional Euclidean space with sublinear memory and a single pass is shown, under the very natural assumption of data separability.

### Evolutionary clustering

- Computer ScienceKDD '06
- 2006

This work presents a generic framework for clustering data over time, and discusses evolutionary versions of two widely-used clustering algorithms within this framework: k-means and agglomerative hierarchical clustering.

### Evaluation of Text Clustering Methods and Their Dataspace Embeddings: An Exploration

- Computer ScienceData Analysis and Rationality in a Complex World
- 2021

A dozen well-known methods and variants in a protocol crossing three contrasted open-access corpora in a few tens transformed dataspaces are compared to their supposed "ground-truth" classes by means of four usual indices.

### Term-Weighting Approaches in Automatic Text Retrieval

- Computer ScienceInf. Process. Manag.
- 1988

### Billion-Scale Similarity Search with GPUs

- Computer ScienceIEEE Transactions on Big Data
- 2021

This paper proposes a novel design for an inline-formula that enables the construction of a high accuracy, brute-force, approximate and compressed-domain search based on product quantization, and applies it in different similarity search scenarios.

### Least squares quantization in PCM

- Computer ScienceIEEE Trans. Inf. Theory
- 1982

The corresponding result for any finite number of quanta is derived; that is, necessary conditions are found that the quanta and associated quantization intervals of an optimum finite quantization scheme must satisfy.