• Corpus ID: 239998683

Learning-Augmented k-means Clustering

@article{Ergun2021LearningAugmentedKC,
  title={Learning-Augmented k-means Clustering},
  author={Jon Ergun and Zhili Feng and Sandeep Silwal and David P. Woodruff and Samson Zhou},
  journal={ArXiv},
  year={2021},
  volume={abs/2110.14094}
}
k-means clustering is a well-studied problem due to its wide applicability. Unfortunately, there exist strong theoretical limits on the performance of any algorithm for the k-means problem on worst-case inputs. To overcome this barrier, we consider a scenario where “advice” is provided to help perform clustering. Specifically, we consider the k-means problem augmented with a predictor that, given any point, returns its cluster label in an approximately optimal clustering up to some, possibly… 

Figures from this paper

References

SHOWING 1-10 OF 63 REFERENCES
Performance of Johnson-Lindenstrauss transform for k-means and k-medians clustering
TLDR
This work shows that the cost of the optimal solution is preserved up to a factor of (1+ε) under a projection onto a random O(log(k /ε) / ε2)-dimensional subspace and that the bound on the dimension is nearly optimal.
Streaming k-means on well-clusterable data
TLDR
A near-optimal streaming approximation algorithm for k-means in high-dimensional Euclidean space with sublinear memory and a single pass is shown, under the very natural assumption of data separability.
Better Guarantees for k-Means and Euclidean k-Median by Primal-Dual Algorithms
TLDR
A new primal-dual approach is presented that allows to exploit the geometric structure of k-means and to satisfy the hard constraint that at most k clusters are selected without deteriorating the approximation guarantee.
Approximate Clustering with Same-Cluster Queries
TLDR
This paper extends the work of Ashtiani et al. to the approximation setting by showing that a few of such same-cluster queries enables one to get a polynomial-time (1+eps)-approximation algorithm for the k-means problem without any margin assumption on the input dataset.
Clustering with Same-Cluster Queries
TLDR
A probabilistic polynomial-time (BPP) algorithm is provided for clustering in a setting where the expert conforms to a center-based clustering with a notion of margin, and a lower bound on the number of queries needed to have a computationally efficient clustering algorithm in this setting is proved.
Clustering under approximation stability
TLDR
It is shown that for any constant c > 1, (c,ε)-approximation-stability of k-median or k-means objectives can be used to efficiently produce a clustering of error O(ε) with respect to the target clustering, as can stability of the min-sum objective if the target clusters are sufficiently large.
Robust Communication-Optimal Distributed Clustering Algorithms
TLDR
This work gives a matching $\Omega(sk+z)$ lower bound on the communication required both for approximating the optimal k-median or k-means objective value up to any constant, and for returning a clustering that is close to the target clustering in Hamming distance.
Spreading vectors for similarity search
TLDR
This work designs and trains a neural net which last layer forms a fixed parameter-free quantizer, such as pre-defined points of a hyper-sphere, and proposes a new regularizer derived from the Kozachenko--Leonenko differential entropy estimator to enforce uniformity and combine it with a locality-aware triplet loss.
Improved Clustering with Augmented k-means
Identifying a set of homogeneous clusters in a heterogeneous dataset is one of the most important classes of problems in statistical modeling. In the realm of unsupervised partitional clustering,
Clustering with Noisy Queries
TLDR
This paper provides the first information theoretic lower bound on the number of queries for clustering with noisy oracle in both situations, and designs novel algorithms that closely match this query complexity lower bound, even when theNumber of clusters is unknown.
...
1
2
3
4
5
...