Efficient Search in Hidden Text of Large DjVu Documents

@inproceedings{Bie2009EfficientSI,
  title={Efficient Search in Hidden Text of Large DjVu Documents},
  author={Janusz S. Bień},
  booktitle={NLP4DL/AT4DL},
  year={2009}
}
The paper describes an open-source tool which allows to present end-users with results of advanced language technologies. It relies on the DjVu format, which for some applications is still superior to other modern formats including PDF/A. The DjVu GPLed tools are not limited just to the DjVuLibre library, but are being supplemented by various new programs, such as pdf2djvu developed by Jakub Wilk. It allows in particular to convert to DjVu the PDF output of popular OCR programs like FineReader… 
Describing Linde’s Dictionary of Polish for Digitalisation Purposes
TLDR
The attempts at digitalising the so called Linde's dictionary of Polish published in 6 volumes between 1807 and 1814 by Samuel Bogumi³ Linde are described and a formal description of the dictionary's structure is worked on, whose purpose will be to allow programmers to design a tool for automatic tagging of the text.
Connecting Data for Digital Libraries: The Library, the Dictionary and the Corpus
TLDR
Two experiments related to enhancing the content of a digital library with data from external repositories are presented, demonstrating how the results of automated OCR obtained with open source tools can be replaced with transcribed content from the corpus.
The IMPACT project Polish Ground-Truth texts as a Djvu corpus
TLDR
The IMPACT project Polish Ground-Truth texts as a Djvu corpus to describe the already implemented idea of DjVu corpora, i.e. corpora which consist of both scanned images and a transcription of the texts with the words associated with their occurrences in the scans.
THE IMPACT PROJECT POLISH GROUND-TRUTH TEXTS
The purpose of the paper is twofold. First, to describe the already implemented idea of DjVu corpora, i.e. corpora which consist of both scanned images and a transcription of the texts with the words
Korpusomat : a Tool for Creating Searchable Morphosyntactically Tagged Corpora
TLDR
Korpusomat combines existing tools, such as morphological analyser, tagger and corpus search engine, and provides an easy-to-use environment for building corpora technically compatible with the National Corpus of Polish from almost any text, including texts in binary formats.
Digital Library 2.0: Source of Knowledge and Research Collaboration Platform
TLDR
The article presents the idea of transforming a conventional digital library into knowledge source and research collaboration platform, facilitating content augmentation, interpretation and co-operation of geographically distributed researchers representing different academic fields.
Digital Libraries at the Crossroads of Digital Information for the Future: 21st International Conference on Asia-Pacific Digital Libraries, ICADL 2019, Kuala Lumpur, Malaysia, November 4–7, 2019, Proceedings
TLDR
This paper consolidates the approach by training an all-in-one model that is able to classify even noisy characters by progressively train a classifier generative adversarial network on the characters from low to high resolution.

References

SHOWING 1-10 OF 12 REFERENCES
Facilitating access to digitalized dictionaries in DjVu format
TLDR
This work intends to adapt Poliqarp (Polyinterpretation Indexing Query and Retrieval Procesor), a GPLed corpus query tool developed in the Institute of Computer Science of Polish Academy of Sciences, to search efficiently the text layer in large multi-volume works.
The hOCR Microformat for OCR Workflow and Results
  • T. Breuel
  • Computer Science
    Ninth International Conference on Document Analysis and Recognition (ICDAR 2007)
  • 2007
TLDR
A new format for representing both intermediate and final OCR results is described, developed in response to the needs of a newly developed OCR system and ground truth data release, which embeds OCR information invisibly inside the HTML and CSS standards.
DjVu document browsing with on-demand loading and rendering of image components
TLDR
This work describes the image structure and software architecture that allows the DjVu system to load and render the required components on demand while minimizing the bandwidth requirements, and the memory requirements in the client.
TEI P5 as an XML Standard for Treebank Encoding
The aim of the paper is to show that a subset of Text Encoding Initiative Guidelines is a reasonable choice as a standard for stand-off XML encoding of syntactically annotated corpora. The proposed
Digitalizing dictionaries of Polish
The author has been actively involved in digitalization of several dictionaries, including the 17th century Knapski's dictionary, the 18th century dictionary of Troc, the 20th century so called
The PAGE (Page Analysis and Ground-Truth Elements) Format Framework
TLDR
PAGE is described, a new XML-based page image representation framework that records information on image characteristics (image borders, geometric distortions and corresponding corrections, binarisation etc.) in addition to layout structure and page content.
Digitalizing dictionaries of Polish Methods of Lexical Analysis: Theoretical assumption and practical applications
  • Wydawnictwo Uniwersytetu w Białymstoku
  • 2009
Digitalizing dictionaries of Polish Methods of Lexical Analysis: Theoretical assumption and practical applications
  • Wydawnictwo Uniwersytetu w Białymstoku
  • 2009
...
1
2
...