Learn More
We report on the results of our experiments on query evaluation in the presence of noisy data. In particular, an OCR generated database and its corresponding 99.8% correct version are used to process a set of queries to determine the effect the degraded version will have on retrieval. It is shown that with the set of scientific documents we use in our(More)
We report on the performance of the vector space model in the presence of OCR errors. We show that average precision and recall is not affected for our full text document collection when the OCR version is compared to its corresponding corrected set. We do see divergence though between the relevant document rankings of the OCR and corrected collections with(More)
We report on two types of experiments with respect to manually-assigned keywords to documents in a collection. The first type of experiment examines the usefulness of manually-assigned keywords to automatic feedback. The second type of experiment considers the potential benefits of these keywords to the user as an interactive tool. Several experiments were(More)
We give a comprehensive report on our experiments with retrieval from OCR-generated text using systems based on standard models of retrieval. More specifically, we show that average precision and recall is not affected by OCR errors across systems for several collections. The collections used in these experiments include both actual OCR-generated text and(More)
This paper describes a new automatic spelling correction program to deal with OCR generated errors. The method used here is based on three principles: 1. Approximate string matching between the misspellings and the terms occuring in the database as opposed to the entire dictionary 2. Local information obtained from the individual documents 3. The use of a(More)
In this paper we describe experiments that investigate the effects of OCR errors on text categorization. In particular, we show that in our environment, OCR errors have no effect on categorization when we use a classifier based on the naive Bayes model. We also observe that dimensionality reduction techniques eliminate a large number of OCR errors and(More)
We report on the design and implementation of a system which automates the process of capturing structured documents from the optically recognized form of printed materials. The system is intended to be used to convert printed collections into their corresponding tagged electronic versions with little or no manual interventon. This conversion process has(More)