• Corpus ID: 24906133

Clustering with Noisy Queries

@article{Mazumdar2017ClusteringWN,
  title={Clustering with Noisy Queries},
  author={Arya Mazumdar and Barna Saha},
  journal={ArXiv},
  year={2017},
  volume={abs/1706.07510}
}
In this paper, we initiate a rigorous theoretical study of clustering with noisy queries (or a faulty oracle). Given a set of $n$ elements, our goal is to recover the true clustering by asking minimum number of pairwise queries to an oracle. Oracle can answer queries of the form : "do elements $u$ and $v$ belong to the same cluster?" -- the queries can be asked interactively (adaptive queries), or non-adaptively up-front, but its answer can be erroneous with probability $p$. In this paper, we… 

Figures from this paper

Query Complexity of Clustering with Side Information

The dramatic power of side information aka similarity matrix on reducing the query complexity of clustering is shown, and intriguing connection to popular community detection models such as the {\em stochastic block model}, significantly generalizes them, and opens up many venues for interesting future research.

Top-m Clustering with a Noisy Oracle

The goal is to identify the top-m clusters in terms of size, using the noisy answers from the oracle, and provides an upper bound which is a function of the number of recovered clusters $m$ and the sizes of the top clusters.

Same-Cluster Querying for Overlapping Clusters

This paper provides upper bounds (with algorithms) on the sufficient number of queries on the more practical scenario of overlapping clusters, and provides algorithmic results under both arbitrary (worst-case) and statistical modeling assumptions.

Optimal Clustering with Noisy Queries via Multi-Armed Bandit

An interesting connection between the problem and multi-armed bandit might provide useful insights for other similar problems, and a new polynomial time algorithm with O ( n ( k +log n ) δ 2 + poly( k, 1 δ , log n )) queries is proposed.

Correlation Clustering with Same-Cluster Queries Bounded by Optimal Cost

This paper presents an efficient algorithm that recovers an exact optimal clustering using at most $2C_{OPT} $ queries and an efficient algorithms that outputs a $2-approximation using at least two queries, both of which are efficient against several known correlation clustering algorithms.

81 : 2 Same-Cluster Queries Bounded by Optimal Cost Funding

This paper presents two efficient algorithms for correlation clustering whose error and query bounds are parameterized by COPT rather than by the number of clusters, and shows that under a plausible complexity assumption, there does not exist any polynomial time algorithm that has an approximation ratio better than 1 + α for an absolute constant α > 0 with o(COPT ) queries.

Clustering with a faulty oracle

This work provides a polynomial time algorithm that recovers all signs correctly with high probability in the presence of noise with queries, improving on the current state-of-the-art due to Mazumdar and Saha.

Towards a Query-Optimal and Time-Efficient Algorithm for Clustering with a Faulty Oracle

A time-efficient algorithm is provided with nearly-optimal query complexity for all constant k and any δ in the regime when information-theoretic recovery is possible and is built on a connection to the stochastic block model.

On Margin-Based Cluster Recovery with Oracle Queries

We study an active cluster recovery problem where, given a set of n points and an oracle answering queries like “are these two points in the same cluster?”, the task is to recover exactly all

Optimal Clustering in Stable Instances Using Combinations of Exact and Noisy Ordinal Queries

This work studies clustering algorithms which operates with ordinal or comparison-based queries (operations) and provides several variants of these algorithms using ordinal operations and, in particular, non-trivial trade-offs between the number of high-cost and low-cost operations that are used.
...

References

SHOWING 1-10 OF 57 REFERENCES

Query Complexity of Clustering with Side Information

The dramatic power of side information aka similarity matrix on reducing the query complexity of clustering is shown, and intriguing connection to popular community detection models such as the {\em stochastic block model}, significantly generalizes them, and opens up many venues for interesting future research.

Clustering with Same-Cluster Queries

A probabilistic polynomial-time (BPP) algorithm is provided for clustering in a setting where the expert conforms to a center-based clustering with a notion of margin, and a lower bound on the number of queries needed to have a computationally efficient clustering algorithm in this setting is proved.

Crowdsourced Clustering: Querying Edges vs Triangles

Through several simulations and experiments on two real data sets on Amazon Mechanical Turk, it is empirically demonstrate that, for a fixed budget, triangle queries uniformly outperform edge queries.

Correlation clustering with noisy input

This work uses the natural semi-definite programming relaxation followed by an interesting rounding phase and uses SDP duality and spectral properties of random matrices to analyserelation clustering, a type of clustering that uses a basic form of input data that uses similarity/dissimilarity information.

Sorting from Noisy Information

This paper presents polynomial time algorithms for solving noisy comparisons and noisy orders and shows that for both models the maximum likelihood solution $\pi^{\ast}$ is close to the original permutation $\pi$.

Clustering Via Crowdsourcing

A major contribution of this paper is to reduce the query complexity to linear or even sublinear in $n$ when mild side information is provided by a machine, and even in presence of crowd errors which are not correctable via resampling.

Crowdsourcing Algorithms for Entity Resolution

This paper considers the problem of designing optimal strategies for asking questions to humans that minimize the expected number of questions asked, and analyzes several strategies that can be claimed as "optimal" for this problem in a recent work but can perform arbitrarily bad in theory.

Correlation Clustering with Noisy Partial Information

A semi-random model for the Correlation Clustering problem on arbitrary graphs G is proposed and two approximation algorithms for Correlationclustering instances from this model are given.

Fault-Tolerant Entity Resolution with the Crowd

This paper establishes how to deduce a consistent ER solution from noisy worker answers as part of the data interpretation problem, and focuses on the next-crowdsource problem which is to find the next task that maximizes the information gain of the ER result for the minimal additional cost.

Aggregating crowdsourced binary ratings

This paper obtains bounds on the error rate of the algorithm and shows it is governed by the expansion of the graph, and demonstrates, using several synthetic and real datasets, that the algorithm outperforms the state of the art.
...