How to Design Robust Algorithms using Noisy Comparison Oracle

  title={How to Design Robust Algorithms using Noisy Comparison Oracle},
  author={Raghavendra Addanki and Sainyam Galhotra and Barna Saha},
  journal={Proc. VLDB Endow.},
Metric based comparison operations such as finding maximum, nearest and farthest neighbor are fundamental to studying various clustering techniques such as k -center clustering and agglomerative hierarchical clustering. These techniques crucially rely on accurate estimation of pairwise distance between records. However, computing exact features of the records, and their pairwise distances is often challenging, and sometimes not possible. We circumvent this challenge by leveraging weak… 

Figures and Tables from this paper

Partitioned K-nearest neighbor local depth for scalable comparison-based learning

Partitioned Nearest Neighbors Local Depth is introduced, a computationally tractable variant of PaLD leveraging the K-nearest neighbors digraph on S and shows that the probability of randomization-induced error Ξ΄ in PaNNLD is no more than 2eβˆ’Ξ΄ K.

A Revenue Function for Comparison-Based Hierarchical Clustering

This paper proposes a new revenue function that allows one to measure the goodness of dendrograms using only comparisons and shows that this function is closely related to Dasgupta’s cost for hierarchical clustering that uses pairwise similarities.

Approximation Algorithms for Large Scale Data Analysis

New facets of fast algorithm design for large scale data analysis that emphasizes on the role of developing approximation algorithms for better polynomial time/query complexity are covered.

Hierarchical Entity Resolution using an Oracle

HierER is developed, a querying strategy that uses record pair similarities to minimize the number of oracle queries while maximizing the identified hierarchical structure and is shown theoretically and empirically that HierER is effective under different similarity noise models and can scale up to million-size datasets.

Optimal Clustering in Stable Instances Using Combinations of Exact and Noisy Ordinal Queries

This work studies clustering algorithms which operates with ordinal or comparison-based queries (operations) and provides several variants of these algorithms using ordinal operations and, in particular, non-trivial trade-offs between the number of high-cost and low-cost operations that are used.

Greedy $k$-Center from Noisy Distance Samples

Active algorithms are proposed, based on ideas such as UCB and Thompson sampling developed in the closely related Multi-Armed Bandit problem, which adaptively decide which queries to send to the oracle and are able to solve the canonical $k$-center problem within an approximation ratio of two with high probability.



Learning Nearest Neighbor Graphs from Noisy Distance Samples

This paper proposes an active algorithm to find the nearest neighbor graph of a dataset of n items and demonstrates efficiency of the method empirically and theoretically, needing only O(n log(n)Delta^-2) queries in favorable settings, where Delta-2 accounts for the effect of noise.

Comparison Based Learning from Weak Oracles

This paper introduces a new weak oracle model, where a non-malicious user responds to a pairwise comparison query only when she is quite sure about the answer, and proposes two algorithms which provably locate the target object in a number of comparisons close to the entropy of the target distribution.

Clustering with a faulty oracle

This work provides a polynomial time algorithm that recovers all signs correctly with high probability in the presence of noise with queries, improving on the current state-of-the-art due to Mazumdar and Saha.

Clustering with Noisy Queries

This paper provides the first information theoretic lower bound on the number of queries for clustering with noisy oracle in both situations, and designs novel algorithms that closely match this query complexity lower bound, even when theNumber of clusters is unknown.

Semi-Supervised Active Clustering with Weak Oracles

The influence of allowing "not-sure" answers from a weak oracle and proposed algorithms to efficiently handle uncertainties are studied and effective performance of the approach in overcoming uncertainties is shown.

Top-k and Clustering with Noisy Comparisons

Efficient algorithms that are guaranteed to achieve correct results with high probability are given, and the cost of these algorithms are analyzed in terms of the total number of comparisons, and it is shown that they are essentially the best possible.

Clustering with Same-Cluster Queries

A probabilistic polynomial-time (BPP) algorithm is provided for clustering in a setting where the expert conforms to a center-based clustering with a notion of margin, and a lower bound on the number of queries needed to have a computationally efficient clustering algorithm in this setting is proved.

Approximate Clustering with Same-Cluster Queries

This paper extends the work of Ashtiani et al. to the approximation setting by showing that a few of such same-cluster queries enables one to get a polynomial-time (1+eps)-approximation algorithm for the k-means problem without any margin assumption on the input dataset.

Query Complexity of Clustering with Side Information

The dramatic power of side information aka similarity matrix on reducing the query complexity of clustering is shown, and intriguing connection to popular community detection models such as the {\em stochastic block model}, significantly generalizes them, and opens up many venues for interesting future research.

Relaxed Oracles for Semi-Supervised Clustering

It is shown that a small query complexity is adequate for effective clustering with high probability by providing better pairs to the weak oracle and an effective algorithm to handle such uncertainties in query responses is proposed.