Building text classifiers using positive and unlabeled examples

@article{Liu2003BuildingTC,
  title={Building text classifiers using positive and unlabeled examples},
  author={B. Liu and Yang Dai and Xiaoli Li and Wee Sun Lee and Philip S. Yu},
  journal={Third IEEE International Conference on Data Mining},
  year={2003},
  pages={179-186}
}
  • B. Liu, Yang Dai, Philip S. Yu
  • Published 19 November 2003
  • Computer Science
  • Third IEEE International Conference on Data Mining
We study the problem of building text classifiers using positive and unlabeled examples. [] Key Method These techniques are based on the same idea, which builds a classifier in two steps. Each existing technique uses a different method for each step. We first introduce some new methods for the two steps, and perform a comprehensive evaluation of all possible combinations of methods of the two steps. We then propose a more principled approach to solving the problem based on a biased formulation of SVM, and…

Tables from this paper

Building High-Performance Classifiers Using Positive and Unlabeled Examples for Text Classification
TLDR
An improved iterative classification approach is proposed which is the extension of Biased-SVM and it is shown that it is effective for text classification and outperforms the Biasing SVM and other two step techniques.
A Novel Reliable Negative Method Based on Clustering for Learning from Positive and Unlabeled Examples
TLDR
A novel method for the first step, which cluster the unlabeled and positive examples to identify the reliable negative document, and then run SVM iteratively to show experimentally that it is efficient and effective.
Semi-Supervised Text Classification Using Positive and Unlabeled Data
TLDR
This method combines the graph-based semi-supervised learning with the two-step method for solving the PU-Learning problem with small P and indicates that the improved method performs well when the size of P is small.
Tri-Training Based Learning from Positive and Unlabeled Data
TLDR
A new tri-training algorithm for the LPU problem is proposed that combines the step 1 of the three LPU algorithms to extract a reliable negative examples set and is proposed to build an initial classifier for the tri- training and replace the bootstrap sampling procedure that has not been thought as a good method.
An Evaluation of Two-Step Techniques for Positive-Unlabeled Learning in Text Classification
TLDR
Five combinations of techniques for two-step approach to positive-unlabeled PU learning problem are evaluated and it is found that using Rocchio method in step 1 and Expectation-Maximization method in steps 2 is most effective combination in experiments.
A More Accurate Text Classifier for Positive and Unlabeled data
TLDR
Comprehensive experiments demonstrate that the proposed CoTrain-Active approach is superior to Biased-SVM which is said to be previous best and especially suitable for those situations where the given positive dataset P is extremely insufficient.
Reliable Negative Extracting Based on kNN for Learning from Positive and Unlabeled Examples
TLDR
A new reliable negative extracting algorithm for step 1 is proposed that adopts kNN algorithm to rank the similarity of unlabeled examples from the k nearest positive examples, and set a threshold to label some unlabeling examples that lower than it as the reliable negative examples rather than the common method to label positive examples.
Building text classifiers using positive, unlabeled and ‘outdated’ examples
TLDR
The results show that the proposed method Transfer‐1DNF can extract more reliable negative examples with lower error rates, and the classifier outperforms the baseline algorithms.
Co-EM Support Vector Machine Based Text Classification from Positive and Unlabeled Examples
  • Bang-zuo Zhang, W. Zuo
  • Computer Science
    2008 First International Conference on Intelligent Networks and Intelligent Systems
  • 2008
TLDR
This paper has brought about a novel method based on multi-view algorithms for learning from positive and unlabeled examples (LPU) by using the co-EM SVM algorithm, which was previously used for semi-supervised learning.
A New PU Learning Algorithm for Text Classification
TLDR
This paper adopts traditional two-step approach by making use of both positive and unlabeled examples, and improves the 1-DNF algorithm by identifying much more reliable negative documents with very low error rate.
...
...

References

SHOWING 1-10 OF 47 REFERENCES
Partially Supervised Classification of Text Documents
TLDR
This paper shows that the problem of identifying documents from a set of documents of a particular topic or class P and a large set M of mixed documents, and that under appropriate conditions, solutions to the constrained optimization problem will give good solution to the partially supervised classification problem.
Text Classification from Labeled and Unlabeled Documents using EM
TLDR
This paper shows that the accuracy of learned text classifiers can be improved by augmenting a small number of labeled training documents with a large pool of unlabeled documents, and presents two extensions to the algorithm that improve classification accuracy under these conditions.
Combining labeled and unlabeled data with co-training
TLDR
A PAC-style analysis is provided for a problem setting motivated by the task of learning to classify web pages, in which the description of each example can be partitioned into two distinct views, to allow inexpensive unlabeled data to augment, a much smaller set of labeled examples.
Combining Labeled and Unlabeled Data for MultiClass Text Categorization
TLDR
This paper develops a framework to incorporate unlabeled data in the Error-Correcting Output Coding (ECOC) setup by first decomposing multiclass problems into multiple binary problems and then using Co-Training to learn the individual binary classi cation problems.
A sequential algorithm for training text classifiers
TLDR
An algorithm for sequential sampling during machine learning of statistical classifiers was developed and tested on a newswire text categorization task and reduced by as much as 500-fold the amount of training data that would have to be manually classified to achieve a given level of effectiveness.
The Value of Unlabeled Data for Classification Problems
TLDR
It is demonstrated that Fisher information matrices can be used to judge the asymp-totic value of unlabeled data and this methodology is applied to both passive partially supervised learning and active learning.
One-Class SVMs for Document Classification
TLDR
The SVM approach as represented by Schoelkopf was superior to all the methods except the neural network one, where it was, although occasionally worse, essentially comparable.
Enhancing Supervised Learning with Unlabeled Data
TLDR
A new semi-supervised learning method called co-learning that is designed to use unlabeled data to enhance standard supervised learning algorithms to leverage off the fact that they have different representations of the hypotheses and are likely to detect different patterns in labeled data.
A re-examination of text categorization methods
TLDR
The results show that SVM, kNN and LLSF signi cantly outperform NNet and NB when the number of positive training instances per category are small, and that all the methods perform comparably when the categories are over 300 instances.
...
...