• Corpus ID: 246016162

Risk bounds for PU learning under Selected At Random assumption

  title={Risk bounds for PU learning under Selected At Random assumption},
  author={Olivier Coudray and Christine Keribin and Pascal Massart and Patrick Pamphile},
Positive-unlabeled learning (PU learning) is known as a special case of semi-supervised binary classification where only a fraction of positive examples are labeled. The challenge is then to find the correct classifier despite this lack of information. Recently, new methodologies have been introduced to address the case where the probability of being labeled may depend on the covariates. In this paper, we are interested in establishing risk bounds for PU learning under this general assumption… 



Beyond the Selected Completely At Random Assumption for Learning from Positive and Unlabeled Data

The empirical analysis supports the theoretical results and shows that taking into account the possibility of a selection bias, even when the labeling mechanism is unknown, improves the trained classifiers.

Instance-Dependent PU Learning by Bayesian Optimal Relabeling

This paper proposes a probabilistic-gap based PU learning algorithms that could automatically label a group positive and negative examples whose labels are identical to the ones assigned by a Bayesian optimal classifier with a consistency guarantee.

Class Prior Estimation from Positive and Unlabeled Data

A new method to estimate the class prior by partially matching the class-conditional density of the positive class to the input density and performing this partial matching in terms of the Pearson divergence is proposed.

Risk bounds for statistical learning

A general theorem providing upper bounds for the risk of an empirical risk minimizer (ERM) when the classification rules belong to some VC-class under margin conditions is proposed and discussed the optimality of these bounds in a minimax sense.

Estimating the Class Prior in Positive and Unlabeled Data Through Decision Tree Induction

This paper proposes a simple yet effective method for estimating the class prior, by estimating the probability that a positive example is selected to be labeled, and shows that this lower bound gets closer to the real probability as the ratio of labeled examples increases.

Semi-Supervised Novelty Detection

It is argued that novelty detection in this semi-supervised setting is naturally solved by a general reduction to a binary classification problem and provides a general solution to the general two-sample problem, that is, the problem of determining whether two random samples arise from the same distribution.

Estimating the class prior and posterior from noisy positives and unlabeled data

This work develops a classification algorithm for estimating posterior distributions from positive-unlabeled data, that is robust to noise in the positive labels and effective for high-dimensional data and proves that these univariate transforms preserve the class prior.

Mixture Proportion Estimation via Kernel Embeddings of Distributions

This work constructs a provably correct algorithm for MPE, and derive convergence rates under certain assumptions on the distribution based on embedding distributions onto an RKHS, and demonstrates that it performs comparably to or better than other algorithms on most datasets.

Classification with imperfect training labels

The knn and SVM classifiers are robust to imperfect training labels, in the sense that the rate of convergence of the excess risks of these classifiers remains unchanged; in fact, the theoretical and empirical results even show that in some cases, imperfect labels may improve the performance of these methods.

Self-PU: Self Boosted and Calibrated Positive-Unlabeled Training

Self-PU obtains significantly improved results on the renowned Alzheimer's Disease Neuroimaging Initiative (ADNI) database over existing methods and demonstrates the state-of-the-art performance of Self-PU on common PU learning benchmarks, which compare favorably against the latest competitors.