# An Improved Hierarchical Bayesian Model of Language for Document Classification

@inproceedings{Allison2008AnIH, title={An Improved Hierarchical Bayesian Model of Language for Document Classification}, author={Ben Allison}, booktitle={COLING}, year={2008} }

This paper addresses the fundamental problem of document classification, and we focus attention on classification problems where the classes are mutually exclusive. In the course of the paper we advocate an approximate sampling distribution for word counts in documents, and demonstrate the model's capacity to outperform both the simple multinomial and more recently proposed extensions on the classification task. We also compare the classifiers to a linear SVM, and show that provided certain…

## 6 Citations

A Technique for Improving the Performance of Naive Bayes Text Classification

- Computer ScienceWISM
- 2011

This paper introduces a conditional probability which takes into account both the information of the whole corpus and each category and performs well in the standard benchmark collections, competing with the state-of-the-art text classifiers especially for the skewed data.

Gamma-Poisson Distribution Model for Text Categorization

- Computer Science
- 2013

A new model for describing word frequency distributions in documents for automatic text classification tasks using the gamma-Poisson probability distribution is introduced, and the results show that the model allows performance comparable to that of the support vector machine and clearly exceeding the multinomial model and the Dirichlet-multinomial models.

Classifying Documents with Poisson Mixtures

- Computer Science
- 2014

The results show that the performance of the generative probabilistic text classifiers built with the Poisson distribution is much better than that of the standard multinomial naive Bayes classifier if the normalization of document length is appropriately taken into account.

A Parallel Algorithm for Bayesian Text Classification Based on Noise Elimination and Dimension Reduction in Spark Computing Environment

- Computer ScienceCLOUD
- 2019

An improved Bayesian algorithm INBCS is proposed, for text classification in the Spark computing environment and improves the NaiveBayesian algorithm based on a polynomial model, which can obtain higher accuracy and efficiency than some current improvements and implementations of the Naivesive Bayesian algorithms in Spark ML-library.

Generating Exact- and Ranked Partially-Matched Answers to Questions in Advertisements

- Computer ScienceProc. VLDB Endow.
- 2011

A QA system for ads, called CQAds, which allows users to post a natural-language question Q for retrieving relevant ads, and is equipped with a Boolean model to evaluate Boolean operators that are either explicitly or implicitly specified in Q, i.e., with or without Boolean operators specified by the users.

## References

SHOWING 1-10 OF 22 REFERENCES

A re-examination of text categorization methods

- Computer ScienceSIGIR '99
- 1999

The results show that SVM, kNN and LLSF signi cantly outperform NNet and NB when the number of positive training instances per category are small, and that all the methods perform comparably when the categories are over 300 instances.

Improving Multiclass Text Classification with the Support Vector Machine

- Computer Science
- 2001

A new indicator of binary performance is developed to show that the SVM’s lower multiclass error is a result of its improved binary performance and the surprising result that one-vs-all classification performs favorably compared to other approaches even though it has no error-correcting properties.

A comparison of event models for naive bayes text classification

- Computer ScienceAAAI 1998
- 1998

It is found that the multi-variate Bernoulli performs well with small vocabulary sizes, but that the multinomial performs usually performs even better at larger vocabulary sizes--providing on average a 27% reduction in error over the multi -variateBernoulli model at any vocabulary size.

Document Classification by Machine:Theory and Practice

- Mathematics, Computer ScienceCOLING
- 1994

A mathematical model of classification schemes and the one scheme which can be proved optimal among all those based on word frequencies is described and an experiment illustrates the efficacy of this classification method.

Tackling the Poor Assumptions of Naive Bayes Text Classifiers

- Computer ScienceICML
- 2003

This paper proposes simple, heuristic solutions to some of the problems with Naive Bayes classifiers, addressing both systemic issues as well as problems that arise because text is not actually generated according to a multinomial model.

Parametric Models of Linguistic Count Data

- Computer ScienceACL
- 2003

This work proposes using zero-inflated models for dealing with occurrence counts of words in documents, and evaluates competing models on a Naive Bayes text classification task.

Text Categorization with Support Vector Machines: Learning with Many Relevant Features

- Computer ScienceECML
- 1998

This paper explores the use of Support Vector Machines (SVMs) for learning text classifiers from examples. It analyzes the particular properties of learning with text data and identifies why SVMs are…

Authorship Attribution of E-Mail: Comparing Classifiers over a New Corpus for Evaluation

- Computer ScienceLREC
- 2008

A newly created subcorpus of the Enron emails is described which is suggested to be used to test techniqes for authorship attribution, and the application of three different classification methods to this task to present baseline results.

The Enron Corpus: A New Dataset for Email Classification Research

- Computer ScienceECML
- 2004

The Enron corpus is introduced as a new test bed for email folder prediction, and the baseline results of a state-of-the-art classifier (Support Vector Machines) are provided under various conditions.

The Beta-Binomial Mixture Model and Its Application to TDT Tracking and Detection

- Computer Science
- 1999

A continuous-mixture statistical model for word occurrence frequencies in documents, and the application of that model to the TDT topic identification tasks and application to the Detection Task are described.