• Corpus ID: 809191

An Extensive Empirical Study of Feature Selection Metrics for Text Classification

  title={An Extensive Empirical Study of Feature Selection Metrics for Text Classification},
  author={George Forman},
  journal={J. Mach. Learn. Res.},
  • George Forman
  • Published 1 March 2003
  • Computer Science
  • J. Mach. Learn. Res.
Machine learning for text classification is the cornerstone of document categorization, news filtering, document routing, and personalization. In text domains, effective feature selection is essential to make the learning task efficient and more accurate. This paper presents an empirical comparison of twelve feature selection methods (e.g. Information Gain) evaluated on a benchmark of 229 text classification problem instances that were gathered from Reuters, TREC, OHSUMED, etc. The results are… 

Choose Your Words Carefully: An Empirical Study of Feature Selection Metrics for Text Classification

This study benchmarks the performance of twelve feature selection metrics across 229 text classification problems drawn from Reuters, OHSUMED, TREC, etc. using Support Vector Machines to reveal an outstanding new feature selection metric, "Bi-Normal Separation" (BNS).

Class-dependent feature selection algorithm for text categorization

A common approach in text categorization is to represent each word as a feature, however, many of these features are irrelevant. So, dimensionality reduction is an important step to diminish the

Clustering based feature selection using Extreme Learning Machines for text classification

A new clustering based feature selection technique that reduces the feature size is proposed that is carried out on 20-Newsgroups and DMOZ datasets to demonstrate the efficiency of the approach using ELM and ML-ELM as the classifiers over the state-of-the-art classifiers.

Comparison of text feature selection policies and using an adaptive framework

A novel feature selection technique for enhancing performance of unbalanced text classification problem

A Modified Chi-Square (ModCHI) based feature selection technique is proposed for enhancing the performance of classification of multi-labeled text documents with unbalanced class distributions on the unbalanced Reuters Dataset.

Using Typical Testors for Feature Selection in Text Categorization

A feature selection method based on Testor Theory that takes into account inter-feature relationships is proposed, which consistently outperformed information gain for both classifiers and both data collections, especially when aggressive feature selection is carried out.

Avoidance of Model Re-Induction in SVM-Based Feature Selection for Text Categorization

This work proposes alternatives to exact re-induction of SVM models during the search for the optimum feature subset and demonstrates that no significant compromises in terms of model quality are made and, moreover, in some cases gains in accuracy can be achieved.

A New Text Categorization Technique Using Distributional Clustering and Learning Logic

A new text categorization method is presented that combines the distributional clustering of words and a learning logic technique, called Lsquare, for constructing text classifiers, that achieves higher or comparable classification accuracy and F1 results compared with SVM on exact experimental settings with a small number of training documents.

Feature Selection for Text Categorisation

The classical supervised methods had the best performance, including Chi Square, Information Gain and Mutual Information, and the Chi Square variant GSS coefficient was also among the top performers.

An evaluation of existing and new feature selection metrics in text categorization

  • S. TasciT. Gungor
  • Computer Science, Economics
    2008 23rd International Symposium on Computer and Information Sciences
  • 2008
An extensive evaluation of the feature selection metrics used in text categorization by using local and global policies and proposes new metrics, which have shown high success rates especially in datasets with a low number of keywords.



A Comparative Study on Feature Selection in Text Categorization

This paper finds strong correlations between the DF IG and CHI values of a term and suggests that DF thresholding the simplest method with the lowest cost in computation can be reliably used instead of IG or CHI when the computation of these measures are too expensive.

A re-examination of text categorization methods

The results show that SVM, kNN and LLSF signi cantly outperform NNet and NB when the number of positive training instances per category are small, and that all the methods perform comparably when the categories are over 300 instances.

Feature Selection for Unbalanced Class Distribution and Naive Bayes

This paper describes an approach to feature subset selection that takes into account problem speciics and learning algorithm characteristics, and shows that considering domain and algorithm characteristics signiicantly improves the results of classiication.

A comparison of event models for naive bayes text classification

It is found that the multi-variate Bernoulli performs well with small vocabulary sizes, but that the multinomial performs usually performs even better at larger vocabulary sizes--providing on average a 27% reduction in error over the multi -variateBernoulli model at any vocabulary size.

Inductive learning algorithms and representations for text categorization

A comparison of the effectiveness of five different automatic learning algorithms for text categorization in terms of learning speed, realtime classification speed, and classification accuracy is compared.

Text Categorization with Support Vector Machines: Learning with Many Relevant Features

This paper explores the use of Support Vector Machines (SVMs) for learning text classifiers from examples. It analyzes the particular properties of learning with text data and identifies why SVMs are

Wrappers for Feature Subset Selection

Centroid-Based Document Classification: Analysis and Experimental Results

The authors' experiments show that this centroidbased classifier consistently and substantially outperforms other algorithms such as Naive Bayesian, k-nearest-neighbors, and C4.5, on a wide range of datasets.

Centroid-Based Document Classifica tion : Analysis & Exper imental Results ∗

This paper focuses on a simple linear-time centroid-based documentclassificational algorithm, that despite its simplicity and robust performance, has not beenextensi vely studied and analyzed.

Benchmarking attribute selection techniques for data mining

The inclusion of irrelevant, redundant and noisy attributes in the model building process phase can result in poor predictive performance and increased uncertainty in the development of data mining applications.