• Publications
  • Influence
Enabling information retrieval on historical document collections: the role of matching procedures and special lexica
Due to the large number of spelling variants found in historical texts, standard methods of Information Retrieval (IR) fail to produce satisfactory results on historical document collections. InExpand
  • 44
  • 4
Orthographic Errors in Web Pages: Toward Cleaner Web Corpora
Since the Web by far represents the largest public repository of natural language texts, recent experiments, methods, and tools in the area of corpus linguistics often use the Web as a corpus. ForExpand
  • 50
  • 3
Application of the Tightness Continuum Measure to Chinese Information Retrieval
Most word segmentation methods employed in Chinese Information Retrieval systems are based on a static dictionary or a model trained against a manually segmented corpus. These general segmentationExpand
  • 11
  • 3
Towards information retrieval on historical document collections: the role of matching procedures and special lexica
Due to the large number of spelling variants found in historical texts, standard methods of Information Retrieval (IR) fail to produce satisfactory results on historical document collections. InExpand
  • 36
  • 2
PoCoTo - an open source system for efficient interactive postcorrection of OCRed historical texts
When applied to historical texts, OCR engines often produce a non-negligible number of OCR errors. For research in the Humanities, text mining and retrieval, the option is important to improve theExpand
  • 22
  • 2
Genre as noise: noise in genre
Given a specific information need, documents of the wrong genre can be considered as noise. From this perspective, genre classification helps to separate relevant documents from noise. OrthographicExpand
  • 21
  • 2
Unsupervised profiling of OCRed historical documents
In search engines and digital libraries, more and more OCRed historical documents become available. Still, access to these texts is often not satisfactory due to two problems: first, the quality ofExpand
  • 22
  • 2
On lexical resources for digitization of historical documents
Many European libraries are currently engaged in mass digitization projects that aim to make historical documents and corpora online available in the Internet. In this context, appropriate lexicalExpand
  • 19
  • 2
Lexical postcorrection of OCR-results:the web as a dynamic secondary dictionary?
Postcorrection of OCR-results for text documents is usuallybased on electronic dictionaries. When scanning textsfrom a specific thematic area, conventional dictionaries oftenmiss a considerableExpand
  • 33
  • 1
A corpus for comparative evaluation of OCR software and postcorrection techniques
We describe a new corpus collected for comparative evaluation of OCR-software and postcorrection techniques. The corpus is freely available for academic groups and use. The major part of the corpusExpand
  • 16
  • 1