• Corpus ID: 246634534

Improving Probabilistic Models in Text Classification via Active Learning

@article{Bosley2022ImprovingPM,
  title={Improving Probabilistic Models in Text Classification via Active Learning},
  author={M. J. Bosley and Saki Kuzushima and Ted Enamorado and Y. Shiraito},
  journal={ArXiv},
  year={2022},
  volume={abs/2202.02629}
}
When using text data, social scientists often classify documents in order to use the resulting document labels as an outcome or predictor. Since it is prohibitively costly to label a large number of documents manually, automated text classification has become a standard tool. However, current approaches for text classification do not take advantage of all the data at one’s disposal. We propose a fast new model for text classification that combines information from both labeled and unlabeled… 
1 Citations

Aktif Öğrenme Yöntemi Kullanarak Nesne Tespiti Object Detection Using Active Learning

The results of the experiments show that almost the same level of success was achieved by labeling a smaller amount of data with the active learning framework, compared to labeling and using all the data, leading to lower labeling costs.

References

SHOWING 1-10 OF 49 REFERENCES

Text Classification from Labeled and Unlabeled Documents using EM

This paper shows that the accuracy of learned text classifiers can be improved by augmenting a small number of labeled training documents with a large pool of unlabeled documents, and presents two extensions to the algorithm that improve classification accuracy under these conditions.

Active Learning Approaches for Labeling Text: Review and Assessment of the Performance of Active Learning Approaches

This paper introduces active learning, a framework in which data to be labeled by human coders are not chosen at random but rather targeted in such a way that the required amount of data to train a machine learning model can be minimized.

Large-scale text categorization by batch mode active learning

A novel active learning algorithm that selects a batch of text documents for labeling manually in each iteration and uses the Fisher information matrix as the measurement of model uncertainty and chooses the set of documents to effectively maximize the Fisher Information matrix of a classification model.

A Method of Automated Nonparametric Content Analysis for Social Science

This work develops a method that gives approximately unbiased estimates of category proportions even when the optimal classifier performs poorly, and illustrates with diverse data sets, including the daily expressed opinions of thousands of people about the U.S. presidency.

A sequential algorithm for training text classifiers

An algorithm for sequential sampling during machine learning of statistical classifiers was developed and tested on a newswire text categorization task and reduced by as much as 500-fold the amount of training data that would have to be manually classified to achieve a given level of effectiveness.

Keyword Assisted Topic Models

It is empirically demonstrate that providing topic models with a small number of keywords can substantially improve their performance, and the proposed keyword assisted topic model (keyATM) provides more interpretable results, has better document classification performance and is less sensitive to the number of topics than the standard topic models.

Theory of Disagreement-Based Active Learning

Recent advances in the understanding of the theoretical benefits of active learning are described, and implications for the design of effective active learning algorithms are described.

A Mixture of Experts Classifier with Learning Based on Both Labelled and Unlabelled Data

A classifier structure and learning algorithm that make effective use of unlabelled data to improve performance and is a "mixture of experts" structure that is equivalent to the radial basis function (RBF) classifier, but unlike RBFs, is amenable to likelihood-based training.

Stopping Active Learning Based on Predicted Change of F Measure for Text Classification

A new stopping method called Predicted Change of F Measure will be introduced that attempts to provide the users an estimate of how much performance of the model is changing at each iteration.

Text Preprocessing For Unsupervised Learning: Why It Matters, When It Misleads, And What To Do About It

To aid researchers working in unsupervised settings, a statistical procedure and software is introduced that examines the sensitivity of findings under alternate preprocessing regimes and provides a characterization of the variability changes in preprocessing choices may induce when analyzing a particular dataset.