Employing EM and Pool-Based Active Learning for Text Classification


This paper shows how a text classifier’s need for labeled training documents can be reduced by taking advantage of a large pool of unlabeled documents. We modify the Query-by-Committee (QBC) method of active learning to use the unlabeled pool for explicitly estimating document density when selecting examples for labeling. Then active learning is combined with ExpectationMaximization in order to “fill in” the class labels of those documents that remain unlabeled. Experimental results show that the improvements to active learning require less than two-thirds as many labeled training examples as previous QBC approaches, and that the combination of EM and active learning requires only slightly more than half as many labeled training examples to achieve the same accuracy as either the improved active learning or EM alone.

Extracted Key Phrases

3 Figures and Tables

Citations per Year

855 Citations

Semantic Scholar estimates that this publication has 855 citations based on the available data.

See our FAQ for additional information.

Cite this paper

@inproceedings{McCallum1998EmployingEA, title={Employing EM and Pool-Based Active Learning for Text Classification}, author={Andrew McCallum and Kamal Nigam}, booktitle={ICML}, year={1998} }