Learn More
—This paper aims to evaluate the accuracy of optical character recognition (OCR) systems on real scanned books. The ground truth e-texts are obtained from the Project Guten-berg website and aligned with their corresponding OCR output using a fast recursive text alignment scheme (RETAS). First, unique words in the vocabulary of the book are aligned with(More)
—An efficient word spotting framework is proposed to search text in scanned books. The proposed method allows one to search for words when optical character recognition (OCR) fails due to noise or for languages where there is no OCR. Given a query word image, the aim is to retrieve matching words in the book sorted by the similarity. In the offline stage,(More)
—Spectral information alone is often not sufficient to distinguish certain terrain classes such as permanent crops like orchards, vineyards, and olive groves from other types of vegetation. However, instances of these classes possess distinctive spatial structures that can be observable in detail in very high spatial resolution images. This paper proposes a(More)
A framework is presented for discovering partial duplicates in large collections of scanned books with optical character recognition (OCR) errors. Each book in the collection is represented by the sequence of words (in the order they appear in the text) which appear only once in the book. These words are referred to as "unique words" and they constitute a(More)
This article presents Ottoman Archives Explorer, a Content-Based Retrieval (CBR) system based on character recognition for printed and handwritten historical documents. Several methods for character segmentation and recognition stages are investigated. In particular, sliding-window and histogram segmentation methods are coupled with recognition approaches(More)
The main goal of existing approaches for structural texture analysis has been the identification of repeating texture primitives and their placement patterns in images containing a single type of texture. We describe a novel unsupervised method for simultaneous detection and localization of multiple structural texture areas along with estimates of their(More)
Conventional retrieval systems view documents as a unit and look at different retrieval types within a document. We introduce Proteus, a frame-work for seamlessly navigating books as dynamic collections which are defined on the fly. Proteus allows us to search various retrieval types. Navigable types include pages, books, named persons, locations, and(More)
—This paper evaluates an automated scheme for aligning and combining optical character recognition (OCR) output from three scans of a book to generate a composite version with fewer OCR errors. While there has been some previous work on aligning multiple OCR versions of the same scan, the scheme introduced in this paper does not require that scans be from(More)
  • 1