Clustering with Queries under Semi-Random Noise

@inproceedings{Pia2022ClusteringWQ,
  title={Clustering with Queries under Semi-Random Noise},
  author={Alberto Del Pia and Mingchen Ma and Christos Tzamos},
  booktitle={COLT},
  year={2022}
}
The seminal paper by Mazumdar and Saha [MS17a] introduced an extensive line of work on clustering with noisy queries. Yet, despite significant progress on the problem, the proposed methods depend crucially on knowing the exact probabilities of errors of the underlying fully-random oracle. In this work, we develop robust learning methods that tolerate general semi-random noise obtaining qualitatively the same guarantees as the best possible methods in the fully-random model. More specifically… 

Tables from this paper

References

SHOWING 1-10 OF 54 REFERENCES

Exact Recovery of Mangled Clusters with Same-Cluster Queries

TLDR
An algorithm is designed that can reconstruct the latent clustering exactly while using only a small number of oracle queries, and can also learn the clusters using low-stretch separators, a class of ellipsoids with additional theoretical guarantees.

Clustering with a faulty oracle

TLDR
This work provides a polynomial time algorithm that recovers all signs correctly with high probability in the presence of noise with queries, improving on the current state-of-the-art due to Mazumdar and Saha.

Towards a Query-Optimal and Time-Efficient Algorithm for Clustering with a Faulty Oracle

TLDR
A time-efficient algorithm is provided with nearly-optimal query complexity for all constant k and any δ in the regime when information-theoretic recovery is possible and is built on a connection to the stochastic block model.

Clustering with Noisy Queries

TLDR
This paper provides the first information theoretic lower bound on the number of queries for clustering with noisy oracle in both situations, and designs novel algorithms that closely match this query complexity lower bound, even when theNumber of clusters is unknown.

Correlation Clustering with Same-Cluster Queries Bounded by Optimal Cost

TLDR
This paper presents an efficient algorithm that recovers an exact optimal clustering using at most $2C_{OPT} $ queries and an efficient algorithms that outputs a $2-approximation using at least two queries, both of which are efficient against several known correlation clustering algorithms.

Approximate Clustering with Same-Cluster Queries

TLDR
This paper extends the work of Ashtiani et al. to the approximation setting by showing that a few of such same-cluster queries enables one to get a polynomial-time (1+eps)-approximation algorithm for the k-means problem without any margin assumption on the input dataset.

A note on: No need to choose: How to get both a PTAS and Sublinear Query Complexity

TLDR
This work revisits various PTAS's (Polynomial Time Approximation Schemes) for minimization versions of dense problems, and shows that they can be performed with sublinear query complexity, and gets a PTAS with efficient query complexity.

Learning to Cluster via Same-Cluster Queries

TLDR
This work proposes two algorithms with provable theoretical guarantees and verifies their effectiveness via an extensive set of experiments on both synthetic and real-world data.

Same-Cluster Querying for Overlapping Clusters

TLDR
This paper provides upper bounds (with algorithms) on the sufficient number of queries on the more practical scenario of overlapping clusters, and provides algorithmic results under both arbitrary (worst-case) and statistical modeling assumptions.

Classification Under Misspecification: Halfspaces, Generalized Linear Models, and Evolvability

TLDR
A much simpler algorithm is given for distribution-independently learning halfspaces under Massart noise with rate ⌘ and a blackbox knowledge distillation procedure is developed to convert an arbitrarily complex classifier to an equally good proper classifier.
...