• Corpus ID: 809191

An Extensive Empirical Study of Feature Selection Metrics for Text Classification

  title={An Extensive Empirical Study of Feature Selection Metrics for Text Classification},
  author={George Forman},
  journal={J. Mach. Learn. Res.},
  • George Forman
  • Published 1 March 2003
  • Computer Science
  • J. Mach. Learn. Res.
Machine learning for text classification is the cornerstone of document categorization, news filtering, document routing, and personalization. In text domains, effective feature selection is essential to make the learning task efficient and more accurate. This paper presents an empirical comparison of twelve feature selection methods (e.g. Information Gain) evaluated on a benchmark of 229 text classification problem instances that were gathered from Reuters, TREC, OHSUMED, etc. The results are… 
Choose Your Words Carefully: An Empirical Study of Feature Selection Metrics for Text Classification
This study benchmarks the performance of twelve feature selection metrics across 229 text classification problems drawn from Reuters, OHSUMED, TREC, etc. using Support Vector Machines to reveal an outstanding new feature selection metric, "Bi-Normal Separation" (BNS).
Class-dependent feature selection algorithm for text categorization
A common approach in text categorization is to represent each word as a feature, however, many of these features are irrelevant. So, dimensionality reduction is an important step to diminish the
Clustering based feature selection using Extreme Learning Machines for text classification
A new clustering based feature selection technique that reduces the feature size is proposed that is carried out on 20-Newsgroups and DMOZ datasets to demonstrate the efficiency of the approach using ELM and ML-ELM as the classifiers over the state-of-the-art classifiers.
Feature Selection with Maximum Information Metric in Text Categorization
A novel feature selection approach for dealing with text categorization, called Maximum Information Metric (MIM), is proposed to get good quality terms of documents, which exploits the weight of term and document frequency to construct the correlation between a term and each class.
Comparison of text feature selection policies and using an adaptive framework
A keyword selection framework called adaptive keyword selection is proposed based on selecting different number of terms for each class and it shows significant improvement on skewed datasets that have a limited number of training instances for some of the classes.
Using Typical Testors for Feature Selection in Text Categorization
A feature selection method based on Testor Theory that takes into account inter-feature relationships is proposed, which consistently outperformed information gain for both classifiers and both data collections, especially when aggressive feature selection is carried out.
In this massive amount of data, data is too vast so that text categorization is important issue. With the help of previously organize set of documents and classes we can automatically classify data.
Avoidance of Model Re-Induction in SVM-Based Feature Selection for Text Categorization
This work proposes alternatives to exact re-induction of SVM models during the search for the optimum feature subset and demonstrates that no significant compromises in terms of model quality are made and, moreover, in some cases gains in accuracy can be achieved.
A New Text Categorization Technique Using Distributional Clustering and Learning Logic
A new text categorization method is presented that combines the distributional clustering of words and a learning logic technique, called Lsquare, for constructing text classifiers, that achieves higher or comparable classification accuracy and F1 results compared with SVM on exact experimental settings with a small number of training documents.
Feature Selection for Text Categorisation
The classical supervised methods had the best performance, including Chi Square, Information Gain and Mutual Information, and the Chi Square variant GSS coefficient was also among the top performers.


A Comparative Study on Feature Selection in Text Categorization
This paper finds strong correlations between the DF IG and CHI values of a term and suggests that DF thresholding the simplest method with the lowest cost in computation can be reliably used instead of IG or CHI when the computation of these measures are too expensive.
A re-examination of text categorization methods
The results show that SVM, kNN and LLSF signi cantly outperform NNet and NB when the number of positive training instances per category are small, and that all the methods perform comparably when the categories are over 300 instances.
Feature Selection for Unbalanced Class Distribution and Naive Bayes
This paper describes an approach to feature subset selection that takes into account problem speciics and learning algorithm characteristics, and shows that considering domain and algorithm characteristics signiicantly improves the results of classiication.
A comparison of event models for naive bayes text classification
It is found that the multi-variate Bernoulli performs well with small vocabulary sizes, but that the multinomial performs usually performs even better at larger vocabulary sizes--providing on average a 27% reduction in error over the multi -variateBernoulli model at any vocabulary size.
Inductive learning algorithms and representations for text categorization
A comparison of the effectiveness of five different automatic learning algorithms for text categorization in terms of learning speed, realtime classification speed, and classification accuracy is compared.
Text Categorization with Support Vector Machines: Learning with Many Relevant Features
This paper explores the use of Support Vector Machines (SVMs) for learning text classifiers from examples. It analyzes the particular properties of learning with text data and identifies why SVMs are
Wrappers for Feature Subset Selection
The wrapper method searches for an optimal feature subset tailored to a particular algorithm and a domain and compares the wrapper approach to induction without feature subset selection and to Relief, a filter approach tofeature subset selection.
Centroid-Based Document Classification: Analysis and Experimental Results
The authors' experiments show that this centroidbased classifier consistently and substantially outperforms other algorithms such as Naive Bayesian, k-nearest-neighbors, and C4.5, on a wide range of datasets.
Benchmarking attribute selection techniques for data mining
The inclusion of irrelevant, redundant and noisy attributes in the model building process phase can result in poor predictive performance and increased uncertainty in the development of data mining applications.
What is the best index of detectability?
Various indices which have been proposed as measures of detectability (for unequal variance normal distributions of signal and nonsignal) are discussed. It is argued that the best measure is an