Machine learning in automated text categorization

  title={Machine learning in automated text categorization},
  author={Fabrizio Sebastiani},
The automated categorization (or classification) of texts into predefined categories has witnessed a booming interest in the last 10 years, due to the increased availability of documents in digital form and the ensuing need to organize them. In the research community the dominant approach to this problem is based on machine learning techniques: a general inductive process automatically builds a classifier by learning, from a set of preclassified documents, the characteristics of the categories… 

Figures and Tables from this paper


A novel method for the automatic induction of rule-based text classifiers using a hypothesis language of the form "if T 1, … or T n occurs in document d, and none of T 1+n ,... T n+m occurs in d, then classify d under category c," where each T i is a conjunction of terms.

Automated Text Categorization with Machine Learning and its Application in Multilingual Text Categorization

The Naïve Bayes, Rocchio and kNN methods within machine learning paradigm for automated text categorization of document in predefined categories and multilingual textategorization, that consists in classifying documents in different languages according to the same classification tree are discussed.

A Survey on Machine Learning Based Text Categorization

The aim here is to work in Text Documents Classification, which aims towards the comparison and construction ofvarious available classifiers depending on few benchmark such as time complexity and performance.

Automated text document categorization

  • R. YasothaE. Charles
  • Computer Science
    2015 IEEE Seventh International Conference on Intelligent Computing and Information Systems (ICICIS)
  • 2015
The experimental results show that the proposed approach is effective for classifying text documents and is applicable to a domain with large number of categories in multiple levels.

Text Representation for Automatic Text Categorization

Today’s learning-based ATC systems are able to reach nearly human-being, and the model basic idea is to induce an automatic classification function by learning categories properties from manually labelled documents, instead of codifying rules by hand to classify documents.

An Introduction to Text Classification

This essay provides an account of the most prominent features of the machine learning approach to text classification, including data preparation, attribute extraction and selection, learning algorithms and kernel methods, performance measures, and availability of training corpora.

A multi-classifier system for text categorization

  • S. Dey
  • Computer Science
  • 2011
The work presented in this paper proposes the construction of a classification model for each of the (pre-defined) categories or themes present in a corpus using a term-frequency based 'keyword' identification and document scoring technique.

Feature Subset Selection in SOM Based Text Categorization

A class of applications, automatic indexing with controlled vocabularies, that is of direct concern to organizing digital libraries is discussed, aimed at classifying scientific papers about computer science with respect to the ACM Classification Scheme.

Categorization and Machine Learning Methods : Current State of the Art By Durga Bhavani Dasari

The paper examines the main approaches to text categorization comparing the machine learning paradigm and present state of the art as well as various issues pertaining to three different text similarity problems.

A Review of Machine Learning Algorithms for Text-Documents Classification

This paper provides a review of the theory and methods of document classification and text mining, focusing on the existing techniques and methodologies, focused mainly on text representation and machine learning techniques.



Automated learning of decision rules for text categorization

It is shown that machine-generated decision rules appear comparable to human performance, while using the identical rule-based representation, and compared with other machine-learning techniques.

A comparison of two learning algorithms for text categorization

It is shown that both algorithms achieve reasonable performance and allow controlled tradeoos between false positives and false negatives, and the stepwise feature selection in the decision tree algorithm is particularly eeective in dealing with the large feature sets common in text categorization.

Experiments on the Use of Feature Selection and Negative Evidence in Automated Text Categorization

This work proposes a novel variant, based on the exploitation of negative evidence, of the well-known k-NN method, and reports the results of systematic experimentation of these two methods performed on the standard REUTERS-21578 benchmark.

Applying an existing machine learning algorithm to text categorization

This paper describes how an existing similarity-based learning algorithm, Charade, is applied to the text categorization problem and compares the results with those obtained using decision tree construction algorithms.

N-gram-based text categorization

An N-gram-based approach to text categorization that is tolerant of textual errors is described, which worked very well for language classification and worked reasonably well for classifying articles from a number of different computer-oriented newsgroups according to subject.

Using a generalized instance set for automatic text categorization

This work proposes a new technique known as the generalized instance set (GIS) algorithm by unifying the strengths of k-NN and linear classifiers and adapting to characteristics of text categorization problems.

ACTION: automatic classification for full-text documents

The key idea of ACTION is a scheme for measuring the significance of each keyword in a given document that takes into account the occurrence frequency of a keyword, but also the logical relationships between the available classes.

Feature Engineering for Text Classification

More sophisticated Natural Language Processing techniques need to be developed before better text representations can be produced for classification.

Context-sensitive learning methods for text categorization

RIPPER and sleeping-experts perform extremely well across a wide variety of categorization problems, generally outperforming previously applied learning methods and are viewed as a confirmation of the usefulness of classifiers that represent contextual information.


The present chapter investigates various methods of Textual Feature-Finding, i.e. methods of choosing textual features or attributes that do not depend on subjective judgement; do not need knowledge sources external to the texts being analyzed; and do not assume that the word is the only possible textual unit.