An Improved Random Forest Classifier for Text Categorization

@article{Xu2012AnIR,
  title={An Improved Random Forest Classifier for Text Categorization},
  author={Baoxun Xu and Xiufeng Guo and Yunming Ye and Jiefeng Cheng},
  journal={J. Comput.},
  year={2012},
  volume={7},
  pages={2913-2920}
}
This paper proposes an improved random forest algorithm for classifying text data. This algorithm is particularly designed for analyzing very high dimensional data with multiple classes whose well-known representative data is text corpus. A novel feature weighting method and tree selection method are developed and synergistically served for making random forest framework well suited to categorize text documents with dozens of topics. With the new feature weighting method for subspace sampling… 
A Semantics Aware Random Forest for Text Classification
TLDR
SARF extracts the features used by trees to generate the predictions and selects a subset of the predictions for which the features are relevant to the predicted classes and evaluated its classification performance on real-world text datasets and assessed its competitiveness with state-of-the-art ensemble selection methods.
An ensemble based NLP feature assessment in binary classification
TLDR
This paper examined the impact of NLP features (stop words, stemmer and combination of both) on predictive performance of base classifiers and ensembles of Naive Bayesian category and found ensemble gives better performance over the base classifier with entire NLP categorical dataset.
Standard measure and SVM measure for feature selection and their performance effect for text classification
TLDR
It is confirmed that the feature selection based on the SVM-score proposed by Sakai and Hirokawa (2012) outperforms the standard measures with small number of features.
Multilayer classification of web pages using random forest and semi-supervised latent dirichlet allocation
TLDR
This paper proposes a new methodology for a multilayer soft classification based on the connection between the semi-supervised Latent Dirichlet Allocation (LDA) and the Random Forest classifier.
Text Classification Algorithms: A Survey
TLDR
An overview of text classification algorithms is discussed, which covers different text feature extractions, dimensionality reduction methods, existing algorithms and techniques, and evaluations methods.
Adapting Sequence Alignments for Text Classification
TLDR
This paper handles the problem of text classification by applying a novel classification method based on sequence alignment with simple fuzzy concepts, and showed expected performance compared to other conventional classifications of natural languages.
A modified multi-class association rule for text mining
TLDR
Experimental results indicate that the proposed Association Classifier, mMCAR, produced high accuracy with a smaller number of classification rules, contributing to the text mining domain as automatic classification of huge and widely distributed textual data could facilitate the text representation and retrieval processes.
Weighted Random Forest Algorithm Based on Bayesian Algorithm
The random forest(RF) algorithm is a very efficient and excellent ensemble classification algorithm. In this paper, we improve the random forest algorithm and propose an algorithm called ‘Bayesian
Automatic categorization of web text documents using fuzzy inference rule
TLDR
A fuzzy rule inference system is presented, which works with newly proposed statistical features for segregating documents that belong to more than one or an undefined category and gets better results than those of reported works, thereby pointing to the language independence of the system.
Comparative Study between Traditional Machine Learning and Deep Learning Approaches for Text Classification
TLDR
This work has created multiple classifiers for document classification and compared their accuracy on raw and processed data and is also exploring hierarchical classifier for classification of classes and subclasses.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 26 REFERENCES
Feature selection for text classification with Naïve Bayes
TLDR
Two feature evaluation metrics for the Naive Bayesian classifier applied on multi-class text datasets are presented: Multi-class Odds Ratio (MOR), and Class Discriminating Measure (CDM).
A re-examination of text categorization methods
TLDR
The results show that SVM, kNN and LLSF signi cantly outperform NNet and NB when the number of positive training instances per category are small, and that all the methods perform comparably when the categories are over 300 instances.
Centroid-Based Document Classification: Analysis and Experimental Results
TLDR
The authors' experiments show that this centroidbased classifier consistently and substantially outperforms other algorithms such as Naive Bayesian, k-nearest-neighbors, and C4.5, on a wide range of datasets.
Neighbor-weighted K-nearest neighbor for unbalanced text corpus
TLDR
The neighbor-weighted K-nearest neighbor algorithm, i.e. NWKNN, is proposed, which achieves significant classification performance improvement on imbalanced corpora.
Text Categorization with Support Vector Machines: Learning with Many Relevant Features
This paper explores the use of Support Vector Machines (SVMs) for learning text classifiers from examples. It analyzes the particular properties of learning with text data and identifies why SVMs are
Expert network: effective and efficient learning from human decisions in text categorization and retrieval
TLDR
The simplicity of the model, the high recall-precision rates, and the efficient computation together make ExpNet preferable as a practical solution for real-world applications.
A comparison of event models for naive bayes text classification
TLDR
It is found that the multi-variate Bernoulli performs well with small vocabulary sizes, but that the multinomial performs usually performs even better at larger vocabulary sizes--providing on average a 27% reduction in error over the multi -variateBernoulli model at any vocabulary size.
Automatic Turkish Text Categorization in Terms of Author, Genre and Gender
TLDR
A first comprehensive text classification using n-gram model has been realized for Turkish, determining the identification of a Turkish document's author, classifying documents according to text's genre and identifying a gender of an author, automatically.
Enriched random forests
TLDR
This work proposes a novel, yet simple, adjustment that has demonstrably superior performance: choose the eligible subsets at each node by weighted random sampling instead of simple random sampling, with the weights tilted in favor of the informative features.
Some Issues in the Automatic Classification of U.S. Patents Working Notes for the AAAI-98 Workshop on Learning for Text Categorization
TLDR
This work uses both k-nearest-neighbor classifiers and Bayesian classifiers to derive a vector of terms and phrases from the most important parts of the patent to represent each document.
...
1
2
3
...