Supporting Text Retrieval by Typographical Term Weighting

Abstract

Text documents stored in information systems usually consist of more information than the pure concatenation of words, i.e., they also contain typographic information. Because conventional text retrieval methods evaluate only the word frequency, they miss the information provided by typography, e.g., regarding the importance of certain terms. In order to overcome this weakness, we present an approach which uses the typographical information of text documents and show how this improves the efficiency of text retrieval methods. Our approach uses weighting of typographic information in addition to term frequencies for separating relevant information in text documents from the noise. We have evaluated our approach on the basis of automated text classification algorithms. The results show that our weighting approach achieves very competitive classification results using at most 30% of the terms used by conventional approaches, which makes our approach significantly more efficient.

DOI: 10.4018/jiit.2007040101

Extracted Key Phrases

1 Figure or Table

Cite this paper

@article{Werner2007SupportingTR, title={Supporting Text Retrieval by Typographical Term Weighting}, author={Lars Werner and Stefan B{\"{o}ttcher}, journal={IJIIT}, year={2007}, volume={3}, pages={1-16} }