A comparison of event models for naive bayes text classification
It is found that the multi-variate Bernoulli performs well with small vocabulary sizes, but that the multinomial performs usually performs even better at larger vocabulary sizes--providing on average a 27% reduction in error over the multi -variateBernoulli model at any vocabulary size.
Text Classification from Labeled and Unlabeled Documents using EM
- K. Nigam, A. McCallum, S. Thrun, Tom Michael Mitchell
- Computer ScienceMachine-mediated learning
- 1 May 2000
This paper shows that the accuracy of learned text classifiers can be improved by augmenting a small number of labeled training documents with a large pool of unlabeled documents, and presents two extensions to the algorithm that improve classification accuracy under these conditions.
Automating the Construction of Internet Portals with Machine Learning
- A. McCallum, K. Nigam, Jason D. M. Rennie, K. Seymore
- Computer ScienceInformation retrieval (Boston)
- 21 July 2000
New research in reinforcement learning, information extraction and text classification that enables efficient spidering, the identification of informative text segments, and the population of topic hierarchies are described.
Analyzing the effectiveness and applicability of co-training
It is demonstrated that when learning from labeled and unlabeled data, algorithms explicitly leveraging a natural independent split of the features outperform algorithms that do not and may out-perform algorithms not using a split.
Efficient clustering of high-dimensional data sets with application to reference matching
This work presents a new technique for clustering large datasets, using a cheap, approximate distance measure to eciently divide the data into overlapping subsets the authors call canopies, and presents ex- perimental results on grouping bibliographic citations from the reference sections of research papers.
Using Maximum Entropy for Text Classification
This paper uses maximum entropy techniques for text classification by estimating the conditional distribution of the class variable given the document by comparing accuracy to naive Bayes and showing that maximum entropy is sometimes significantly better, but also sometimes worse.
Employing EM and Pool-Based Active Learning for Text Classification
This paper shows how a text classifier’s need for labeled training documents can be reduced by taking advantage of a large pool of unlabeled documents. We modify the Query-by-Committee (QBC) method…
Learning to Extract Symbolic Knowledge from the World Wide Web
The goal of the research described here is to automatically create a computer understandable world wide knowledge base whose content mirrors that of the World Wide Web, and several machine learning algorithms for this task are described.
Learning to construct knowledge bases from the World Wide Web
Learning to Classify Text from Labeled and Unlabeled Documents
It is shown that the accuracy of text classifiers trained with a small number of labeled documents can be improved by augmenting this small training set with a large pool of unlabeled documents, and an algorithm is introduced based on the combination of Expectation-Maximization with a naive Bayes classifier.