An Improved Hierarchical Bayesian Model of Language for Document Classification

  title={An Improved Hierarchical Bayesian Model of Language for Document Classification},
  author={Ben Allison},
  • B. Allison
  • Published in COLING 18 August 2008
  • Computer Science
This paper addresses the fundamental problem of document classification, and we focus attention on classification problems where the classes are mutually exclusive. In the course of the paper we advocate an approximate sampling distribution for word counts in documents, and demonstrate the model's capacity to outperform both the simple multinomial and more recently proposed extensions on the classification task. We also compare the classifiers to a linear SVM, and show that provided certain… 

Figures and Tables from this paper

A Technique for Improving the Performance of Naive Bayes Text Classification
This paper introduces a conditional probability which takes into account both the information of the whole corpus and each category and performs well in the standard benchmark collections, competing with the state-of-the-art text classifiers especially for the skewed data.
Gamma-Poisson Distribution Model for Text Categorization
A new model for describing word frequency distributions in documents for automatic text classification tasks using the gamma-Poisson probability distribution is introduced, and the results show that the model allows performance comparable to that of the support vector machine and clearly exceeding the multinomial model and the Dirichlet-multinomial models.
Classifying Documents with Poisson Mixtures
The results show that the performance of the generative probabilistic text classifiers built with the Poisson distribution is much better than that of the standard multinomial naive Bayes classifier if the normalization of document length is appropriately taken into account.
A Parallel Algorithm for Bayesian Text Classification Based on Noise Elimination and Dimension Reduction in Spark Computing Environment
An improved Bayesian algorithm INBCS is proposed, for text classification in the Spark computing environment and improves the NaiveBayesian algorithm based on a polynomial model, which can obtain higher accuracy and efficiency than some current improvements and implementations of the Naivesive Bayesian algorithms in Spark ML-library.
Generating Exact- and Ranked Partially-Matched Answers to Questions in Advertisements
A QA system for ads, called CQAds, which allows users to post a natural-language question Q for retrieving relevant ads, and is equipped with a Boolean model to evaluate Boolean operators that are either explicitly or implicitly specified in Q, i.e., with or without Boolean operators specified by the users.


A re-examination of text categorization methods
The results show that SVM, kNN and LLSF signi cantly outperform NNet and NB when the number of positive training instances per category are small, and that all the methods perform comparably when the categories are over 300 instances.
Improving Multiclass Text Classification with the Support Vector Machine
A new indicator of binary performance is developed to show that the SVM’s lower multiclass error is a result of its improved binary performance and the surprising result that one-vs-all classification performs favorably compared to other approaches even though it has no error-correcting properties.
A comparison of event models for naive bayes text classification
It is found that the multi-variate Bernoulli performs well with small vocabulary sizes, but that the multinomial performs usually performs even better at larger vocabulary sizes--providing on average a 27% reduction in error over the multi -variateBernoulli model at any vocabulary size.
Document Classification by Machine:Theory and Practice
A mathematical model of classification schemes and the one scheme which can be proved optimal among all those based on word frequencies is described and an experiment illustrates the efficacy of this classification method.
Tackling the Poor Assumptions of Naive Bayes Text Classifiers
This paper proposes simple, heuristic solutions to some of the problems with Naive Bayes classifiers, addressing both systemic issues as well as problems that arise because text is not actually generated according to a multinomial model.
Parametric Models of Linguistic Count Data
This work proposes using zero-inflated models for dealing with occurrence counts of words in documents, and evaluates competing models on a Naive Bayes text classification task.
Text Categorization with Support Vector Machines: Learning with Many Relevant Features
This paper explores the use of Support Vector Machines (SVMs) for learning text classifiers from examples. It analyzes the particular properties of learning with text data and identifies why SVMs are
Authorship Attribution of E-Mail: Comparing Classifiers over a New Corpus for Evaluation
A newly created subcorpus of the Enron emails is described which is suggested to be used to test techniqes for authorship attribution, and the application of three different classification methods to this task to present baseline results.
The Enron Corpus: A New Dataset for Email Classification Research
The Enron corpus is introduced as a new test bed for email folder prediction, and the baseline results of a state-of-the-art classifier (Support Vector Machines) are provided under various conditions.
The Beta-Binomial Mixture Model and Its Application to TDT Tracking and Detection
A continuous-mixture statistical model for word occurrence frequencies in documents, and the application of that model to the TDT topic identification tasks and application to the Detection Task are described.