Efficient sparse spherical k-means for document clustering

@article{Knittel2021EfficientSS,
  title={Efficient sparse spherical k-means for document clustering},
  author={Johannes Knittel and Steffen Koch and Thomas Ertl},
  journal={Proceedings of the 21st ACM Symposium on Document Engineering},
  year={2021}
}
Spherical k-Means is frequently used to cluster document collections because it performs reasonably well in many settings and is computationally efficient. However, the time complexity increases linearly with the number of clusters k, which limits the suitability of the algorithm for larger values of k depending on the size of the collection. Optimizations targeted at the Euclidean k-Means algorithm largely do not apply because the cosine distance is not a metric. We therefore propose an… 

Tables from this paper

Document clustering

The main clustering algorithms used for text data are critically analyzed, considering prototype‐based, graph‐ based, hierarchical, and model‐based approaches.

References

SHOWING 1-10 OF 12 REFERENCES

Concept Decompositions for Large Sparse Text Data Using Clustering

The concept vectors produced by the spherical k-means algorithm constitute a powerful sparse and localized “basis” for text data sets and are localized in the word space, are sparse, and tend towards orthonormality.

Using the Triangle Inequality to Accelerate k-Means

The accelerated k-means algorithm is shown how to accelerate dramatically, while still always computing exactly the same result as the standard algorithm, and is effective for datasets with up to 1000 dimensions, and becomes more and more effective as the number k of clusters increases.

Streaming k-means approximation

A clustering algorithm that approximately optimizes the k-means objective, in the one-pass streaming setting, which is applicable to unsupervised learning on massive data sets, or resource-constrained devices.

Exact and Approximate Maximum Inner Product Search with LEMP

An extensive experimental study provides insight into the performance of many state-of-the-art techniques—including LEMP—on multiple real-world datasets and found that LEMP often was significantly faster or more accurate than alternative methods.

Streaming k-means on well-clusterable data

A near-optimal streaming approximation algorithm for k-means in high-dimensional Euclidean space with sublinear memory and a single pass is shown, under the very natural assumption of data separability.

Evolutionary clustering

This work presents a generic framework for clustering data over time, and discusses evolutionary versions of two widely-used clustering algorithms within this framework: k-means and agglomerative hierarchical clustering.

Evaluation of Text Clustering Methods and Their Dataspace Embeddings: An Exploration

  • A. LeluM. Cadot
  • Computer Science
    Data Analysis and Rationality in a Complex World
  • 2021
A dozen well-known methods and variants in a protocol crossing three contrasted open-access corpora in a few tens transformed dataspaces are compared to their supposed "ground-truth" classes by means of four usual indices.

Term-Weighting Approaches in Automatic Text Retrieval

Billion-Scale Similarity Search with GPUs

This paper proposes a novel design for an inline-formula that enables the construction of a high accuracy, brute-force, approximate and compressed-domain search based on product quantization, and applies it in different similarity search scenarios.

Least squares quantization in PCM

The corresponding result for any finite number of quanta is derived; that is, necessary conditions are found that the quanta and associated quantization intervals of an optimum finite quantization scheme must satisfy.