• Corpus ID: 2748729

Partially Supervised Classification of Text Documents

  title={Partially Supervised Classification of Text Documents},
  author={B. Liu and Wee Sun Lee and Philip S. Yu and Xiaoli Li},
We investigate the following problem: Given a set of documents of a particular topic or class P , and a large set M of mixed documents that contains documents from class P and other types of documents, identify the documents from class P in M . The key feature of this problem is that there is no labeled nonP document, which makes traditional machine learning techniques inapplicable, as they all need labeled documents of both classes. We call this problem partially supervised classification. In… 

Figures and Tables from this paper

Text classification without labeled negative documents
A partition-based heuristic which aims at extracting both of the positive and negative documents in U, which outperforms all of the existing approaches significantly, especially in the case where the size of P is extremely small.
Classifying Documents Without Labels
This paper focuses on the classification of an unlabelled set of documents into two classes: relevant and irrelevant, given a topic of interest, and proves, via experimentation, that this method is capable of accurately classify a set of Documents into relevant and relevant classes.
Semi-supervised text categorization with only a few positive and unlabeled documents
  • Fang Lu, Qingyuan Bai
  • Computer Science
    2010 3rd International Conference on Biomedical Engineering and Informatics
  • 2010
This paper proposes a refined method to do the PU-Learning with the known technique combining Rocchio and K-means algorithm, and shows that the refined method can perform better when the set P is very small.
Text classification from positive and unlabeled documents
This paper explores an efficient extension of the standard Support Vector Machine approach, called SVMC (Support Vector Mapping Convergence) for the TC-WON tasks, and shows that when the positive training data is not too under-sampled, SVMC significantly outperforms other methods.
Classification from Positive and Unlabeled Documents
This paper explores an efficient extension of the standard Support Vector Machine (SVM) approach, called SVMC (Support Vector Mapping Convergence) for the TC-WON tasks, and shows that when the positive training data is not too under-sampled, SVMC significantly outperforms other methods.
Text Classification by Labeling Words
This paper proposes a method that combines clustering and feature selection that labels a set of representative words for each class, and can effectively rank the words in the unlabeled set according to their importance.
Building a Text Classifier by a Keyword and Unlabeled Documents
This paper studies the problem of building a text classifier from a keyword and unlabeled documents, so as to avoid labeling documents manually, and shows that the proposed approach could help to build excellent text classifiers.
Corpus Based Unsupervised Labeling of Documents
This work explores a novel method of assigning labels to documents without using any training data that uses clustering to build semantically related sets that are used as candidate labels to Documents.
Building text classifiers using positive and unlabeled examples
A more principled approach to solving the problem of building text classifiers using positive and unlabeled examples based on a biased formulation of SVM is proposed, and it is shown experimentally that it is more accurate than the existing techniques.
Semi-Supervised Text Classification Using Positive and Unlabeled Data
This method combines the graph-based semi-supervised learning with the two-step method for solving the PU-Learning problem with small P and indicates that the improved method performs well when the size of P is small.


Learning to Classify Text from Labeled and Unlabeled Documents
It is shown that the accuracy of text classifiers trained with a small number of labeled documents can be improved by augmenting this small training set with a large pool of unlabeled documents, and an algorithm is introduced based on the combination of Expectation-Maximization with a naive Bayes classifier.
An Evaluation of Statistical Approaches to Text Categorization
Analysis and empirical evidence suggest that the evaluation results on some versions of Reuters were significantly affected by the inclusion of a large portion of unlabelled documents, mading those results difficult to interpret and leading to considerable confusions in the literature.
A comparison of two learning algorithms for text categorization
It is shown that both algorithms achieve reasonable performance and allow controlled tradeoos between false positives and false negatives, and the stepwise feature selection in the decision tree algorithm is particularly eeective in dealing with the large feature sets common in text categorization.
NewsWeeder: Learning to Filter Netnews
Text Categorization with Support Vector Machines: Learning with Many Relevant Features
This paper explores the use of Support Vector Machines (SVMs) for learning text classifiers from examples. It analyzes the particular properties of learning with text data and identifies why SVMs are
Learning from positive and unlabeled examples
A comparison of event models for naive bayes text classification
It is found that the multi-variate Bernoulli performs well with small vocabulary sizes, but that the multinomial performs usually performs even better at larger vocabulary sizes--providing on average a 27% reduction in error over the multi -variateBernoulli model at any vocabulary size.
Pac Learning from Positive Statistical Queries ?
It is shown that k-DNF and k-decision lists are learnable in both models, i.e. with far less information than it is assumed in previously used algorithms.
Efficient noise-tolerant learning from statistical queries
This paper formalizes a new but related model of learning from statistical queries, and demonstrates the generality of the statistical query model, showing that practically every class learnable in Valiant’s model and its variants can also be learned in the new model (and thus can be learning in the presence of noise).
Learning from Positive Data
  • S. Muggleton
  • Computer Science
    Inductive Logic Programming Workshop
  • 1996
New results are presented which show that within a Bayesian framework not only grammars, but also logic programs are learnable with arbitrarily low expected error from positive examples only and the upper bound for expected error of a learner which maximises the Bayes' posterior probability is within a small additive term of one which does the same from a mixture of positive and negative examples.