• Corpus ID: 25928202

Semi-Supervised Active Clustering with Weak Oracles

  title={Semi-Supervised Active Clustering with Weak Oracles},
  author={Taewan Kim and Joydeep Ghosh},
Semi-supervised active clustering (SSAC) utilizes the knowledge of a domain expert to cluster data points by interactively making pairwise "same-cluster" queries. However, it is impractical to ask human oracles to answer every pairwise query. In this paper, we study the influence of allowing "not-sure" answers from a weak oracle and propose algorithms to efficiently handle uncertainties. Different types of model assumptions are analyzed to cover realistic scenarios of oracle abstraction. In the… 
Relaxed Oracles for Semi-Supervised Clustering
It is shown that a small query complexity is adequate for effective clustering with high probability by providing better pairs to the weak oracle and an effective algorithm to handle such uncertainties in query responses is proposed.
Same-Cluster Querying for Overlapping Clusters
This paper provides upper bounds (with algorithms) on the sufficient number of queries on the more practical scenario of overlapping clusters, and provides algorithmic results under both arbitrary (worst-case) and statistical modeling assumptions.
A PAC-Theory of Clustering with Advice
The trade-offs between computational and advice complexities of learning are investigated, showing that using a little bit of advice can turn an otherwise computationally hard clustering problem into a tractable one.
How to Design Robust Algorithms using Noisy Comparison Oracle
This paper studies various problems that include finding maximum, nearest/farthest neighbor search under two different noise models called adversarial and probabilistic noise, and gives robust algorithms for k -center clustering and agglomerative hierarchical clustering.
Entropy-based active sparse subspace clustering
A novel extension for SSC with active learning framework is proposed, in which the most informative pairwise constraints are selected to guide the SSC for accurate clustering results.
Query K-means Clustering and the Double Dixie Cup Problem
We consider the problem of approximate $K$-means clustering with outliers and side information provided by same-cluster queries and possibly noisy answers. Our solution shows that, under some mild


A probabilistic framework for semi-supervised clustering
A probabilistic model for semi-supervised clustering based on Hidden Markov Random Fields (HMRFs) that provides a principled framework for incorporating supervision into prototype-based clustering and experimental results demonstrate the advantages of the proposed framework.
Active Semi-Supervision for Pairwise Constrained Clustering
Experimental and theoretical results confirm that this active querying of pairwise constraints significantly improves the accuracy of clustering when given a relatively small amount of supervision.
Clustering under Perturbation Resilience
This paper presents an algorithm that can optimally cluster instances resilient to $(1 + \sqrt{2})$-factor perturbations, solving an open problem of Awasthi et al.
Clustering with Constraints: Feasibility Issues and the k-Means Algorithm
A key finding is that determining whether there is a feasible solution satisfying all constraints is, in general, NP-complete, and this motivates the derivation of a new version of the k-Means algorithm that minimizes the constrained vector quantization error but at each iteration does not attempt to satisfy all constraints.
Representation Learning for Clustering: A Statistical Framework
A formal statistical model for analyzing the sample complexity of learning a clustering representation with this paradigm is provided and a notion of capacity of a class of possible representations is introduced, in the spirit of the VC-dimension, showing that classes of representations that have finite such dimension can be successfully learned with sample size error bounds.
Clustering Via Crowdsourcing
A major contribution of this paper is to reduce the query complexity to linear or even sublinear in $n$ when mild side information is provided by a machine, and even in presence of crowd errors which are not correctable via resampling.
Clustering with Bregman Divergences
This paper proposes and analyzes parametric hard and soft clustering algorithms based on a large class of distortion functions known as Bregman divergences, and shows that there is a bijection between regular exponential families and a largeclass of BRegman diverGences, that is called regular Breg man divergence.
Semi-Supervised Clustering with User Feedback
This work presents an approach to clustering based on the observation that "it is easier to criticize than to construct" and demonstrates semi-supervised clustering with a system that learns to cluster news stories from a Reuters data set.
A Dimension-Independent Generalization Bound for Kernel Supervised Principal Component Analysis
This work provides a guarantee indicating that KSPCA generalizes well even when the number of parameters is large, as long as they have small norms, which justies the good performance of KSP CA on high-dimensional data.
User-Friendly Tail Bounds for Sums of Random Matrices
  • J. Tropp
  • Mathematics
    Found. Comput. Math.
  • 2012
This paper presents new probability inequalities for sums of independent, random, self-adjoint matrices and provides noncommutative generalizations of the classical bounds associated with the names Azuma, Bennett, Bernstein, Chernoff, Hoeffding, and McDiarmid.