Turning big data into tiny data: Constant-size coresets for k-means, PCA and projective clustering

@article{Feldman2013TurningBD,
  title={Turning big data into tiny data: Constant-size coresets for k-means, PCA and projective clustering},
  author={Dan Feldman and Melanie Schmidt and Christian Sohler},
  journal={ArXiv},
  year={2013},
  volume={abs/1807.04518}
}
@d can be approximated up to (1 + e)-factor, for an arbitrary small e > 0, using the O(k/e2)-rank approximation of A and a constant. This implies, for example, that the optimal k-means clustering of the rows of A is (1 + e)-approximated by an optimal k-means clustering of their projection on the O(k/e2) first right singular vectors (principle components) of A. A (j, k)-coreset for projective clustering is a small set of points that yields a (1 + e)-approximation to the sum of squared… 

Figures from this paper

Deterministic Coresets for k-Means of Big Sparse Data †

The first such coreset of size independent of d is suggested, which is also the first deterministic coreset construction whose resulting size is not exponential in d.

Tight Sensitivity Bounds For Smaller Coresets

Experimental results on real-world datasets, including the English Wikipedia documents-term matrix, show that the bounds provided provide significantly smaller and data-dependent coresets also in practice.

New Coresets for Projective Clustering and Applications

This paper proposes the first algorithm that returns an L ∞ coreset of size polynomial in R d and gives the first strong coreset construction for general M -estimator regression, and provides experimental results based on real-world datasets, showing the efficacy of the approach.

Strong Coresets for k-Median and Subspace Approximation: Goodbye Dimension

  • C. SohlerDavid P. Woodruff
  • Mathematics, Computer Science
    2018 IEEE 59th Annual Symposium on Foundations of Computer Science (FOCS)
  • 2018
The first strong coresets for the k-median and subspace approximation problems with sum of distances objective function are obtained, with a number of weighted points that is independent of both n and d; namely, their coresets have size poly(k/ε).

Performance of Johnson-Lindenstrauss transform for k-means and k-medians clustering

This work shows that the cost of the optimal solution is preserved up to a factor of (1+ε) under a projection onto a random O(log(k /ε) / ε2)-dimensional subspace and that the bound on the dimension is nearly optimal.

Near-optimal Coresets for Robust Clustering

This work constructs coresets for robust clustering in R d with m outliers by adapting to the outlier setting a recent framework, overcoming a new challenge that the participating terms in the cost, particularly the excluded m outlier points, are dependent on the center set C.

Dimensionality Reduction for k-Means Clustering and Low Rank Approximation

This work shows how to approximate a data matrix A with a much smaller sketch ~A that can be used to solve a general class of constrained k-rank approximation problems to within (1+ε) error, and gives a simple alternative to known algorithms that has applications in the streaming setting.

Dimensionality Reduction for the Sum-of-Distances Metric

A dimensionality reduction procedure to approximate the sum of distances of a given set of n points in R to any “shape” that lies in a k-dimensional subspace of R, and can be used to obtain poly(k/ε) size coresets for k-median and (k, 1)-subspace approximation problems in polynomial time.

Towards optimal lower bounds for k-median and k-means coresets

This paper achieves tight bounds for k-median in Euclidean spaces up to a factor O(ε−1 polylog k/ε), and is the first construction breaking through the ε−2· min(d,ε−2) barrier inherent in all previous coreset constructions.

Coresets for clustering in Euclidean spaces: importance sampling is nearly optimal

A unified two-stage importance sampling framework that constructs an ε-coreset for the (k,z)-clustering problem and relies on a new dimensionality reduction technique that connects two well-known shape fitting problems: subspace approximation and clustering, and may be of independent interest.
...

References

SHOWING 1-10 OF 97 REFERENCES

Dimensionality Reduction for k-Means Clustering and Low Rank Approximation

This work shows how to approximate a data matrix A with a much smaller sketch ~A that can be used to solve a general class of constrained k-rank approximation problems to within (1+ε) error, and gives a simple alternative to known algorithms that has applications in the streaming setting.

Improved Approximation Algorithms for Large Matrices via Random Projections

  • Tamás Sarlós
  • Computer Science
    2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06)
  • 2006
The key idea is that low dimensional embeddings can be used to eliminate data dependence and provide more versatile, linear time pass efficient matrix computation.

Random Projections for $k$-means Clustering

It is proved that any set of n points in d dimensions can be projected into t = Ω(k/e2) dimensions, for any e ∈ (0, 1/3), in O(nd[e-2k/ log(d)]) time, such that with constant probability the optimal k-partition of the point set is preserved within a factor of 2 + e.

Low-Rank Approximation and Regression in Input Sparsity Time

We design a new distribution over m × n matrices S so that, for any fixed n × d matrix A of rank r, with probability at least 9/10, ∥SAx∥2 = (1 ± ε)∥Ax∥2 simultaneously for all x ∈ Rd. Here, m is

On Approximate Geometric K-clustering

  • J. Matou
  • Computer Science, Mathematics
  • 1999
A deterministic algorithm is presented that nds a 2-clustering with cost no worse than (1 + ")-times the minimum cost in time O(n log n); the constant of proportionality depends polynomially on "".

A near-linear algorithm for projective clustering integer points

The main result is a randomized algorithm that for any e > 0 runs in time O(mn polylog(mn)) and outputs a solution that with high probability is within (1 + e) of the optimal solution.

Clustering Large Graphs via the Singular Value Decomposition

This paper considers the problem of partitioning a set of m points in the n-dimensional Euclidean space into k clusters, and considers a continuous relaxation of this discrete problem: find the k-dimensional subspace V that minimizes the sum of squared distances to V of the m points, and argues that the relaxation provides a generalized clustering which is useful in its own right.

A unified framework for approximating and clustering data

A unified framework for constructing coresets and approximate clustering for general sets of functions, and shows how to generalize the results of the framework for squared distances, distances to the qth power, and deterministic constructions.

Approximate clustering via core-sets

It is shown that for several clustering problems one can extract a small set of points, so that using those core-sets enable us to perform approximate clustering efficiently and are a substantial improvement over what was previously known.

Coresets and approximate clustering for Bregman divergences

The first coreset construction for this problem for a large subclass of Bregman divergences, including important dissimilarity measures such as the Kullback-Leibler divergence and the Itakura-Saito divergence is given.
...