Text Classification from Labeled and Unlabeled Documents using EM

@article{Nigam2004TextCF,
  title={Text Classification from Labeled and Unlabeled Documents using EM},
  author={Kamal Nigam and Andrew McCallum and Sebastian Thrun and Tom Michael Mitchell},
  journal={Machine Learning},
  year={2004},
  volume={39},
  pages={103-134}
}
This paper shows that the accuracy of learned text classifiers can be improved by augmenting a small number of labeled training documents with a large pool of unlabeled documents. This is important because in many text classification problems obtaining training labels is expensive, while large quantities of unlabeled documents are readily available.We introduce an algorithm for learning from labeled and unlabeled documents based on the combination of Expectation-Maximization (EM) and a naive… 
A Two Step Data Mining Approach for Amharic Text Classification
TLDR
This paper intended to implement an algorithm for learning from labeled and unlabeled documents based on the combination of Expectation- Maximization (EM) and two classifiers: Naive Bayes (NB) and locally weighted learning (LWL).
Improving Probabilistic Models in Text Classification via Active Learning
TLDR
This work proposes a fast new model for text classification that combines information from both labeled and unlabeled data with an active learning component, where a human iteratively labels documents that the algorithm is least certain about.
Semi-supervised text classification from unlabeled documents using class associated words
TLDR
A learning algorithm, based on the combination of Expectation-Maximization and a Naïve Bayes classifier, is introduced to classify documents from fully unlabeled documents using class associated words, which has good classification capability with high accuracy.
Text classification from positive and unlabeled documents
TLDR
This paper explores an efficient extension of the standard Support Vector Machine approach, called SVMC (Support Vector Mapping Convergence) for the TC-WON tasks, and shows that when the positive training data is not too under-sampled, SVMC significantly outperforms other methods.
Semi-supervised Text Classification Using Partitioned EM
TLDR
This paper proposes a clustering based partitioning technique that first partitions the training documents in a hierarchical fashion using hard clustering, and prunes the tree using the labeled data after running the expectation maximization algorithm in each partition.
Active Learning with Labeled and Unlabeled Documents in Text Categorization
TLDR
An initial naïve Bayesian classifier is built, an active learning method called Uncertainty Sampling with new similarity idea for batch selection has been used to select more informative documents for learning, and a boosting committee is built based on the derived naive Bayesian.
Classification from Positive and Unlabeled Documents
TLDR
This paper explores an efficient extension of the standard Support Vector Machine (SVM) approach, called SVMC (Support Vector Mapping Convergence) for the TC-WON tasks, and shows that when the positive training data is not too under-sampled, SVMC significantly outperforms other methods.
A model for handling approximate, noisy or incomplete labeling in text classification
TLDR
A Bayesian model, BayesANIL, that is capable of estimating uncertainties associated with the labeling process, and provides an intuitive modification to the EM iterations by re-estimating the empirical.
Text Classification by Labeling Words
TLDR
This paper proposes a method that combines clustering and feature selection that labels a set of representative words for each class, and can effectively rank the words in the unlabeled set according to their importance.
Automatic Text Classification from Labeled and Unlabeled Data
TLDR
This chapter presents a semi-supervised text classification framework that is based on the radial basis function (RBF) neural networks and can learn for classification effectively from a very small quantity of labeled training samples and a large pool of additional unlabeled documents.
...
...

References

SHOWING 1-10 OF 79 REFERENCES
Employing EM and Pool-Based Active Learning for Text Classification
This paper shows how a text classifier’s need for labeled training documents can be reduced by taking advantage of a large pool of unlabeled documents. We modify the Query-by-Committee (QBC) method
Employing Em in Pool-based Active Learning for Text Classiication
TLDR
This paper shows how a text classiier's need for labeled training data can be reduced by a combination of active learning and Expectation Maximization on a pool of unlabeled data and presents a metric for better measuring disagreement among committee members.
Combining labeled and unlabeled data with co-training
TLDR
A PAC-style analysis is provided for a problem setting motivated by the task of learning to classify web pages, in which the description of each example can be partitioned into two distinct views, to allow inexpensive unlabeled data to augment, a much smaller set of labeled examples.
Expert network: effective and efficient learning from human decisions in text categorization and retrieval
TLDR
The simplicity of the model, the high recall-precision rates, and the efficient computation together make ExpNet preferable as a practical solution for real-world applications.
Improving Text Classification by Shrinkage in a Hierarchy of Classes
TLDR
This paper shows that the accuracy of a naive Bayes text classi er can be improved by taking advantage of a hierarchy of classes, and adopts an established statistical technique called shrinkage that smoothes parameter estimates of a data-sparse child with its parent in order to obtain more robust parameter estimates.
Committee-Based Sampling For Training Probabilistic Classifiers
Active Learning with Committees for Text Categorization
TLDR
This paper reports on experiments using a committee of Winnowbased learners and demonstrates that this approach can reduce the number of labeled training examples required over that used by a single Winnow learner by l-2 orders of magnitude.
A Mixture of Experts Classifier with Learning Based on Both Labelled and Unlabelled Data
TLDR
A classifier structure and learning algorithm that make effective use of unlabelled data to improve performance and is a "mixture of experts" structure that is equivalent to the radial basis function (RBF) classifier, but unlike RBFs, is amenable to likelihood-based training.
A comparison of event models for naive bayes text classification
TLDR
It is found that the multi-variate Bernoulli performs well with small vocabulary sizes, but that the multinomial performs usually performs even better at larger vocabulary sizes--providing on average a 27% reduction in error over the multi -variateBernoulli model at any vocabulary size.
Context-sensitive learning methods for text categorization
TLDR
RIPPER and sleeping-experts perform extremely well across a wide variety of categorization problems, generally outperforming previously applied learning methods and are viewed as a confirmation of the usefulness of classifiers that represent contextual information.
...
...