An Improved Random Forest Classifier for Text Categorization

  title={An Improved Random Forest Classifier for Text Categorization},
  author={Baoxun Xu and Xiufeng Guo and Yunming Ye and Jiefeng Cheng},
  journal={J. Comput.},
This paper proposes an improved random forest algorithm for classifying text data. This algorithm is particularly designed for analyzing very high dimensional data with multiple classes whose well-known representative data is text corpus. A novel feature weighting method and tree selection method are developed and synergistically served for making random forest framework well suited to categorize text documents with dozens of topics. With the new feature weighting method for subspace sampling… 

Figures and Tables from this paper

A Semantics Aware Random Forest for Text Classification

SARF extracts the features used by trees to generate the predictions and selects a subset of the predictions for which the features are relevant to the predicted classes and evaluated its classification performance on real-world text datasets and assessed its competitiveness with state-of-the-art ensemble selection methods.

An ensemble based NLP feature assessment in binary classification

This paper examined the impact of NLP features (stop words, stemmer and combination of both) on predictive performance of base classifiers and ensembles of Naive Bayesian category and found ensemble gives better performance over the base classifier with entire NLP categorical dataset.

Standard measure and SVM measure for feature selection and their performance effect for text classification

It is confirmed that the feature selection based on the SVM-score proposed by Sakai and Hirokawa (2012) outperforms the standard measures with small number of features.

Multilayer classification of web pages using random forest and semi-supervised latent dirichlet allocation

This paper proposes a new methodology for a multilayer soft classification based on the connection between the semi-supervised Latent Dirichlet Allocation (LDA) and the Random Forest classifier.

Text Classification Algorithms: A Survey

An overview of text classification algorithms is discussed, which covers different text feature extractions, dimensionality reduction methods, existing algorithms and techniques, and evaluations methods.

Adapting Sequence Alignments for Text Classification

This paper handles the problem of text classification by applying a novel classification method based on sequence alignment with simple fuzzy concepts, and showed expected performance compared to other conventional classifications of natural languages.

A modified multi-class association rule for text mining

Experimental results indicate that the proposed Association Classifier, mMCAR, produced high accuracy with a smaller number of classification rules, contributing to the text mining domain as automatic classification of huge and widely distributed textual data could facilitate the text representation and retrieval processes.

Weighted Random Forest Algorithm Based on Bayesian Algorithm

The main idea underlying the proposed model is to replace the supermajority voting of random forests into weighted voting, using the Bayesian formula to dynamically update the weight value for each tree, so that the strong classifier has higher voting power and effectively improves the overall performance of classification.

Automatic categorization of web text documents using fuzzy inference rule

A fuzzy rule inference system is presented, which works with newly proposed statistical features for segregating documents that belong to more than one or an undefined category and gets better results than those of reported works, thereby pointing to the language independence of the system.

Comparative Study between Traditional Machine Learning and Deep Learning Approaches for Text Classification

This work has created multiple classifiers for document classification and compared their accuracy on raw and processed data and is also exploring hierarchical classifier for classification of classes and subclasses.



Feature selection for text classification with Naïve Bayes

A re-examination of text categorization methods

The results show that SVM, kNN and LLSF signi cantly outperform NNet and NB when the number of positive training instances per category are small, and that all the methods perform comparably when the categories are over 300 instances.

Centroid-Based Document Classification: Analysis and Experimental Results

The authors' experiments show that this centroidbased classifier consistently and substantially outperforms other algorithms such as Naive Bayesian, k-nearest-neighbors, and C4.5, on a wide range of datasets.

Text Categorization with Support Vector Machines: Learning with Many Relevant Features

This paper explores the use of Support Vector Machines (SVMs) for learning text classifiers from examples. It analyzes the particular properties of learning with text data and identifies why SVMs are

Expert network: effective and efficient learning from human decisions in text categorization and retrieval

The simplicity of the model, the high recall-precision rates, and the efficient computation together make ExpNet preferable as a practical solution for real-world applications.

A comparison of event models for naive bayes text classification

It is found that the multi-variate Bernoulli performs well with small vocabulary sizes, but that the multinomial performs usually performs even better at larger vocabulary sizes--providing on average a 27% reduction in error over the multi -variateBernoulli model at any vocabulary size.

Automatic Turkish Text Categorization in Terms of Author, Genre and Gender

A first comprehensive text classification using n-gram model has been realized for Turkish, determining the identification of a Turkish document's author, classifying documents according to text's genre and identifying a gender of an author, automatically.

Enriched random forests

This work proposes a novel, yet simple, adjustment that has demonstrably superior performance: choose the eligible subsets at each node by weighted random sampling instead of simple random sampling, with the weights tilted in favor of the informative features.

Some Issues in the Automatic Classification of U.S. Patents Working Notes for the AAAI-98 Workshop on Learning for Text Categorization

This work uses both k-nearest-neighbor classifiers and Bayesian classifiers to derive a vector of terms and phrases from the most important parts of the patent to represent each document.