• Publications
  • Influence
A comparison of event models for naive bayes text classification
TLDR
It is found that the multi-variate Bernoulli performs well with small vocabulary sizes, but that the multinomial performs usually performs even better at larger vocabulary sizes--providing on average a 27% reduction in error over the multi -variateBernoulli model at any vocabulary size. Expand
Text Classification from Labeled and Unlabeled Documents using EM
TLDR
This paper shows that the accuracy of learned text classifiers can be improved by augmenting a small number of labeled training documents with a large pool of unlabeled documents, and presents two extensions to the algorithm that improve classification accuracy under these conditions. Expand
Automating the Construction of Internet Portals with Machine Learning
TLDR
New research in reinforcement learning, information extraction and text classification that enables efficient spidering, the identification of informative text segments, and the population of topic hierarchies are described. Expand
Analyzing the effectiveness and applicability of co-training
TLDR
It is demonstrated that when learning from labeled and unlabeled data, algorithms explicitly leveraging a natural independent split of the features outperform algorithms that do not and may out-perform algorithms not using a split. Expand
Efficient clustering of high-dimensional data sets with application to reference matching
TLDR
This work presents a new technique for clustering large datasets, using a cheap, approximate distance measure to eciently divide the data into overlapping subsets the authors call canopies, and presents ex- perimental results on grouping bibliographic citations from the reference sections of research papers. Expand
Using Maximum Entropy for Text Classification
TLDR
This paper uses maximum entropy techniques for text classification by estimating the conditional distribution of the class variable given the document by comparing accuracy to naive Bayes and showing that maximum entropy is sometimes significantly better, but also sometimes worse. Expand
Learning to Extract Symbolic Knowledge from the World Wide Web
TLDR
The goal of the research described here is to automatically create a computer understandable world wide knowledge base whose content mirrors that of the World Wide Web, and several machine learning algorithms for this task are described. Expand
Employing EM and Pool-Based Active Learning for Text Classification
This paper shows how a text classifier’s need for labeled training documents can be reduced by taking advantage of a large pool of unlabeled documents. We modify the Query-by-Committee (QBC) methodExpand
Learning to construct knowledge bases from the World Wide Web
TLDR
The goal of the research described here is to automatically create a computer understandable knowledge base whose content mirrors that of the World Wide Web, and several machine learning algorithms for this task are described, and promising initial results with a prototype system that has created a knowledge base describing university people, courses, and research projects. Expand
Learning to Classify Text from Labeled and Unlabeled Documents
TLDR
It is shown that the accuracy of text classifiers trained with a small number of labeled documents can be improved by augmenting this small training set with a large pool of unlabeled documents, and an algorithm is introduced based on the combination of Expectation-Maximization with a naive Bayes classifier. Expand
...
1
2
3
4
...