Text Classification from Labeled and Unlabeled Documents using EM
@article{Nigam2004TextCF, title={Text Classification from Labeled and Unlabeled Documents using EM}, author={Kamal Nigam and Andrew McCallum and Sebastian Thrun and Tom Michael Mitchell}, journal={Machine Learning}, year={2004}, volume={39}, pages={103-134} }
This paper shows that the accuracy of learned text classifiers can be improved by augmenting a small number of labeled training documents with a large pool of unlabeled documents. This is important because in many text classification problems obtaining training labels is expensive, while large quantities of unlabeled documents are readily available.We introduce an algorithm for learning from labeled and unlabeled documents based on the combination of Expectation-Maximization (EM) and a naive…
3,093 Citations
A Two Step Data Mining Approach for Amharic Text Classification
- Computer Science
- 2014
This paper intended to implement an algorithm for learning from labeled and unlabeled documents based on the combination of Expectation- Maximization (EM) and two classifiers: Naive Bayes (NB) and locally weighted learning (LWL).
Improving Probabilistic Models in Text Classification via Active Learning
- Computer ScienceArXiv
- 2022
This work proposes a fast new model for text classification that combines information from both labeled and unlabeled data with an active learning component, where a human iteratively labels documents that the algorithm is least certain about.
Semi-supervised text classification from unlabeled documents using class associated words
- Computer Science2009 International Conference on Computers & Industrial Engineering
- 2009
A learning algorithm, based on the combination of Expectation-Maximization and a Naïve Bayes classifier, is introduced to classify documents from fully unlabeled documents using class associated words, which has good classification capability with high accuracy.
Text classification from positive and unlabeled documents
- Computer ScienceCIKM '03
- 2003
This paper explores an efficient extension of the standard Support Vector Machine approach, called SVMC (Support Vector Mapping Convergence) for the TC-WON tasks, and shows that when the positive training data is not too under-sampled, SVMC significantly outperforms other methods.
Semi-supervised Text Classification Using Partitioned EM
- Computer ScienceDASFAA
- 2004
This paper proposes a clustering based partitioning technique that first partitions the training documents in a hierarchical fashion using hard clustering, and prunes the tree using the labeled data after running the expectation maximization algorithm in each partition.
Active Learning with Labeled and Unlabeled Documents in Text Categorization
- Computer Science
- 2002
An initial naïve Bayesian classifier is built, an active learning method called Uncertainty Sampling with new similarity idea for batch selection has been used to select more informative documents for learning, and a boosting committee is built based on the derived naive Bayesian.
Classification from Positive and Unlabeled Documents
- Computer Science
- 2010
This paper explores an efficient extension of the standard Support Vector Machine (SVM) approach, called SVMC (Support Vector Mapping Convergence) for the TC-WON tasks, and shows that when the positive training data is not too under-sampled, SVMC significantly outperforms other methods.
A model for handling approximate, noisy or incomplete labeling in text classification
- Computer ScienceICML
- 2005
A Bayesian model, BayesANIL, that is capable of estimating uncertainties associated with the labeling process, and provides an intuitive modification to the EM iterations by re-estimating the empirical.
Text Classification by Labeling Words
- Computer ScienceAAAI
- 2004
This paper proposes a method that combines clustering and feature selection that labels a set of representative words for each class, and can effectively rank the words in the unlabeled set according to their importance.
Automatic Text Classification from Labeled and Unlabeled Data
- Computer Science
- 2012
This chapter presents a semi-supervised text classification framework that is based on the radial basis function (RBF) neural networks and can learn for classification effectively from a very small quantity of labeled training samples and a large pool of additional unlabeled documents.
References
SHOWING 1-10 OF 79 REFERENCES
Employing EM and Pool-Based Active Learning for Text Classification
- Computer ScienceICML
- 1998
This paper shows how a text classifier’s need for labeled training documents can be reduced by taking advantage of a large pool of unlabeled documents. We modify the Query-by-Committee (QBC) method…
Employing Em in Pool-based Active Learning for Text Classiication
- Computer Science
- 1998
This paper shows how a text classiier's need for labeled training data can be reduced by a combination of active learning and Expectation Maximization on a pool of unlabeled data and presents a metric for better measuring disagreement among committee members.
Combining labeled and unlabeled data with co-training
- Computer ScienceCOLT' 98
- 1998
A PAC-style analysis is provided for a problem setting motivated by the task of learning to classify web pages, in which the description of each example can be partitioned into two distinct views, to allow inexpensive unlabeled data to augment, a much smaller set of labeled examples.
Expert network: effective and efficient learning from human decisions in text categorization and retrieval
- Computer ScienceSIGIR '94
- 1994
The simplicity of the model, the high recall-precision rates, and the efficient computation together make ExpNet preferable as a practical solution for real-world applications.
Improving Text Classification by Shrinkage in a Hierarchy of Classes
- Computer ScienceICML
- 1998
This paper shows that the accuracy of a naive Bayes text classi er can be improved by taking advantage of a hierarchy of classes, and adopts an established statistical technique called shrinkage that smoothes parameter estimates of a data-sparse child with its parent in order to obtain more robust parameter estimates.
Active Learning with Committees for Text Categorization
- Computer ScienceAAAI/IAAI
- 1997
This paper reports on experiments using a committee of Winnowbased learners and demonstrates that this approach can reduce the number of labeled training examples required over that used by a single Winnow learner by l-2 orders of magnitude.
A Mixture of Experts Classifier with Learning Based on Both Labelled and Unlabelled Data
- Computer ScienceNIPS
- 1996
A classifier structure and learning algorithm that make effective use of unlabelled data to improve performance and is a "mixture of experts" structure that is equivalent to the radial basis function (RBF) classifier, but unlike RBFs, is amenable to likelihood-based training.
A comparison of event models for naive bayes text classification
- Computer ScienceAAAI 1998
- 1998
It is found that the multi-variate Bernoulli performs well with small vocabulary sizes, but that the multinomial performs usually performs even better at larger vocabulary sizes--providing on average a 27% reduction in error over the multi -variateBernoulli model at any vocabulary size.
Context-sensitive learning methods for text categorization
- Computer ScienceSIGIR '96
- 1996
RIPPER and sleeping-experts perform extremely well across a wide variety of categorization problems, generally outperforming previously applied learning methods and are viewed as a confirmation of the usefulness of classifiers that represent contextual information.