Text Categorization with Support Vector Machines: Learning with Many Relevant Features

Abstract

This paper explores the use of Support Vector Machines (SVMs) for learning text classifiers from examples. It analyzes the particular properties of learning with text data and identifies why SVMs arc appropriate for this task. Empirical results support the theoretical findings. SVMs achieve substantial improvements over the currently best performing methods and behave robustly over a variety of different learning tasks. Furthermore, they are fully automatic, eliminating the need for manual parameter tuning. 1 I n t r o d u c t i o n With the rapid growth of online information, text categorization has become one of the key techniques for handling and organizing text data. Text categorization techniques are used to classify news stories, to find interesting information on the WWW, and to guide a user's search through hypertext. Since building text classifiers by hand is difficult and time-consuming, it is advantageous to learn classifiers from examples. In this paper I will explore and identify the benefits of Support Vector Machines (SVMs) for text categorization. SVMs are a new learning method introduced by V. Vapnik et al. [9] [1]. They are well-founded in terms of computational learning theory and very open to theoretical understanding and analysis. After reviewing the standard feature vector representation of text, I will identify the particular properties of text in this representation in section 4. I will argue that SVMs are very well suited for learning in this setting. The empirical results in section 5 will support this claim. Compared to state-of-the-art methods, SVMs show substantial performance gains. Moreover, in contrast to conventional text classification methods SVMs will prove to be very robust, eliminating the need for expensive parameter tuning. 2 T e x t C a t e g o r i z a t i o n The goal of text categorization is the classification of documents into a fixed number of predefined categories. Each document can be in multiple, exactly one, or no category at all. Using machine learning, the objective is to learn classifiers

DOI: 10.1007/BFb0026683

Extracted Key Phrases

0200400'99'01'03'05'07'09'11'13'15'17
Citations per Year

7,215 Citations

Semantic Scholar estimates that this publication has 7,215 citations based on the available data.

See our FAQ for additional information.

Cite this paper

@inproceedings{Joachims1998TextCW, title={Text Categorization with Support Vector Machines: Learning with Many Relevant Features}, author={Thorsten Joachims}, booktitle={ECML}, year={1998} }