Learn More
We report on the performance of the vector space model in the presence of OCR errors. We show that average precision and recall is not affected for our full text document collection when the OCR version is compared to its corresponding corrected set. We do see divergence though between the relevant document rankings of the OCR and corrected collections with(More)
We report on the results of our experiments on query evaluation in the presence of noisy data. In particular, an OCR generated database and its corresponding 99.8% correct version are used to process a set of queries to determine the effect the degraded version will have on retrieval. It is shown that with the set of scientific documents we use in our(More)
We report on two types of experiments with respect to manually-assigned keywords to documents in a collection. The first type of experiment examines the usefulness of manually-assigned keywords to automatic feedback. The second type of experiment considers the potential benefits of these keywords to the user as an interactive tool. Several experiments were(More)
We give a comprehensive report on our experiments with retrieval from OCR-generated text using systems based on standard models of retrieval. More specifically, we show that average precision and recall is not affected by OCR errors across systems for several collections. The collections used in these experiments include both actual OCR-generated text and(More)
Optical Character Recognition (OCR) is a critical part of many text-based applications. Although some commercial systems use the output from OCR devices to index documents without editing, there is very little quantitative data on the impact of OCR errors on the accuracy of a text retrieval system. Because of the diiculty of constructing test collections to(More)
This paper describes a new automatic spelling correction program to deal with OCR generated errors. The method used here is based on three principles: 1. Approximate string matching between the misspellings and the terms occuring in the database as opposed to the entire dictionary 2. Local information obtained from the individual documents 3. The use of a(More)