• Corpus ID: 239998683

Learning-Augmented k-means Clustering

  title={Learning-Augmented k-means Clustering},
  author={Jon Ergun and Zhili Feng and Sandeep Silwal and David P. Woodruff and Samson Zhou},
k-means clustering is a well-studied problem due to its wide applicability. Unfortunately, there exist strong theoretical limits on the performance of any algorithm for the k-means problem on worst-case inputs. To overcome this barrier, we consider a scenario where “advice” is provided to help perform clustering. Specifically, we consider the k-means problem augmented with a predictor that, given any point, returns its cluster label in an approximately optimal clustering up to some, possibly… 

Figures from this paper


Performance of Johnson-Lindenstrauss transform for k-means and k-medians clustering
This work shows that the cost of the optimal solution is preserved up to a factor of (1+ε) under a projection onto a random O(log(k /ε) / ε2)-dimensional subspace and that the bound on the dimension is nearly optimal.
Streaming k-means on well-clusterable data
A near-optimal streaming approximation algorithm for k-means in high-dimensional Euclidean space with sublinear memory and a single pass is shown, under the very natural assumption of data separability.
Better Guarantees for k-Means and Euclidean k-Median by Primal-Dual Algorithms
A new primal-dual approach is presented that allows to exploit the geometric structure of k-means and to satisfy the hard constraint that at most k clusters are selected without deteriorating the approximation guarantee.
Approximate Clustering with Same-Cluster Queries
This paper extends the work of Ashtiani et al. to the approximation setting by showing that a few of such same-cluster queries enables one to get a polynomial-time (1+eps)-approximation algorithm for the k-means problem without any margin assumption on the input dataset.
Clustering with Same-Cluster Queries
A probabilistic polynomial-time (BPP) algorithm is provided for clustering in a setting where the expert conforms to a center-based clustering with a notion of margin, and a lower bound on the number of queries needed to have a computationally efficient clustering algorithm in this setting is proved.
Clustering under approximation stability
It is shown that for any constant c > 1, (c,ε)-approximation-stability of k-median or k-means objectives can be used to efficiently produce a clustering of error O(ε) with respect to the target clustering, as can stability of the min-sum objective if the target clusters are sufficiently large.
Robust Communication-Optimal Distributed Clustering Algorithms
This work gives a matching $\Omega(sk+z)$ lower bound on the communication required both for approximating the optimal k-median or k-means objective value up to any constant, and for returning a clustering that is close to the target clustering in Hamming distance.
Spreading vectors for similarity search
This work designs and trains a neural net which last layer forms a fixed parameter-free quantizer, such as pre-defined points of a hyper-sphere, and proposes a new regularizer derived from the Kozachenko--Leonenko differential entropy estimator to enforce uniformity and combine it with a locality-aware triplet loss.
Improved Clustering with Augmented k-means
Identifying a set of homogeneous clusters in a heterogeneous dataset is one of the most important classes of problems in statistical modeling. In the realm of unsupervised partitional clustering,
Clustering with Noisy Queries
This paper provides the first information theoretic lower bound on the number of queries for clustering with noisy oracle in both situations, and designs novel algorithms that closely match this query complexity lower bound, even when theNumber of clusters is unknown.