Jan O. Pedersen

Learn More
This paper is a comparative study of feature selection methods in statistical learning of text categorization The focus is on aggres sive dimensionality reduction Five meth ods were evaluated including term selection based on document frequency DF informa tion gain IG mutual information MI a test CHI and term strength TS We found IG and CHI most e ective in(More)
Web spam pages use various techniques to achieve higher-than-deserved rankings in a search engine’s results. While human experts can identify spam, it is too expensive to manually evaluate a large number of pages. Instead, we propose techniques to semi-automatically separate reputable, good pages from spam. We first select a small set of seed pages to be(More)
Document clustering has not been well received as an information retrieval tool. Objections to its use fall into two main categories: first, that clustering is too slow for large corpora (with running time often quadratic in the number of documents); and second, that clustering does not appreciably improve retrieval. We argue that these problems arise only(More)
s by the selection of sentences. American Documen-<lb>tation, 12(2):139–143, April 1961. [13] U. Reimer and U. Hahn. Text condensationas knowledge base<lb>abstraction. In IEEE Conf. on AI Applications, pages 338–<lb>344, 1988.<lb>[14] G. Salton, J. Alan, and C. Buckley. Approaches to passage<lb>retrieval in full text information systems. In Proceedings(More)
We present Scatter/Gather, a cluster-based document browsing method, as an alternative to ranked titles for the organization and viewing of retrieval results. We systematically evaluate Scatter/Gather in this context and nd signi cant improvements over similarity search ranking alone. This result provides evidence validating the cluster hypothesis which(More)
We present an implementation of a part-of-speech tagger based on a hidden Markov model. The methodology enables robust and accurate tagging with few resource requirements. Only a lexicon and some unlabeled training text are required. Accuracy exceeds 96%. We describe implementation strategies and optimizations which result in high-speed operation. Three(More)
This paper proposes an algorithm for word sense disambiguation based on a vector representation of word similarity derived from lexical co-occurrence. It diiers from standard approaches by allowing for as ne grained distinctions as is warranted by the information at hand, rather than supposing a xed number of senses per word, and by allowing for more than(More)
This paper presents a new method for computing a thesaurus from a text corpus. Each word is rep­ resented as a vector in a multi-dimensional space that captures cooccurrence information. Words are defined to be similar if they have similar cooccur­ rence patterns. Two different methods for using these thesaurus vectors in information retrieval are shown to(More)
The Scatter/Gather document browsing method uses fast document clustering to produce table-of-contents-like outlines of large document collections. Previous work [1] developed linear-time document clustering algorithms to establish the feasibility of this method over moderately large collections. However, even linear-time algorithms are too slow to support(More)