Impact of Crowdsourcing OCR Improvements on Retrievability Bias

  title={Impact of Crowdsourcing OCR Improvements on Retrievability Bias},
  author={Myriam C. Traub and Thaer Samar and Jacco van Ossenbruggen and Lynda Hardman},
  journal={Proceedings of the 18th ACM/IEEE on Joint Conference on Digital Libraries},
Digitized document collections often suffer from OCR errors that may impact a document's readability and retrievability. We studied the effects of correcting OCR errors on the retrievability of documents in a historic newspaper corpus of a digital library. We computed retrievability scores for the uncorrected documents using queries from the library's search log, and found that the document OCR character error rate and retrievability score are strongly correlated. We computed retrievability… 
Assessing the Impact of OCR Quality on Downstream NLP Tasks
A series of extrinsic assessment tasks are performed using popular, out-of-the-box tools in order to quantify the impact of OCR quality on these tasks and find a consistent impact resulting from OCR errors on downstream tasks with some tasks more irredeemably harmed by O CR errors.
Music in newspapers: interdisciplinary opportunities and data-related challenges
This paper discusses how considering music-related mentionings in newspapers can enable potential new research directions and questions, and discusses open syntactic and semantic data-related technical challenges when analyzing music- related mentioning in digitized historical newspaper collections.
The influences of social value orientation and domain knowledge on crowdsourcing manuscript transcription
The analysis confirmed that in crowdsourced manuscript transcription, social value orientation has a significant effect on participants’ cooperation level and transcription quality; domain knowledge has asignificant effect on Participants’ transcription quality, but not on their cooperation level.


Querylog-based assessment of retrievability bias in a large newspaper corpus
The effectiveness of the retrievability measure is investigated using a large digitized newspaper corpus, featuring two characteristics that distinguishes its experiments from previous studies: (1) compared to TREC collections, this collection contains noise originating from OCR processing, historical spelling and use of language; and (2) instead of simulated queries, the collection comes with real user query logs including click data.
User-driven correction of OCR errors: combining crowdsourcing and information retrieval technology
A new approach to the correction of noisy OCR text which combines the power of crowdsourcing with information retrieval technology, based on standard technology (JAVA, Lucene, Ajax).
Impact of OCR Errors on the Use of Digital Libraries: Towards a Better Access to Information
The impact of OCR errors on the use of a major online platform: The Gallica digital library from the National Library of France is estimated, underlining the critical extent to which OCR quality impacts on digital library access.
Efficiently Estimating Retrievability Bias
This paper examines how many queries are needed to obtain a reliable and useful approximation of the retrievability bias imposed by the system, and an estimate of the individual retrieevability of documents in the collection.
Effects of OCR Errors on Ranking and Feedback Using the Vector Space Model
Quantifying retrieval bias in Web archive search
It is shown that the retrievability of documents can vary for different versions of the same document and that retrieval systems induce biases to different extents, and the effect of bias across time is studied using the retrieevability measure.
Estimating retrievability ranks of documents using document features
Impact Analysis of OCR Quality on Research Tasks in Digital Archives
The current knowledge situation on the users’ and data providers’ side is insufficient and needs to be improved and the classification of scholarly research tasks is provided that gives account of their susceptibility to specific OCR-induced biases and the data required for uncertainty estimations.
Retrieval Models Versus Retrievability
This chapter explains the concept of retrievability in information retrieval, how it can be estimated and how it could be used for analysing a retrieval bias of retrieval models, and how the retrievable measure can be used to improve effectiveness.
Retrievability and Retrieval Bias: A Comparison of Inequality Measures
This work suggests that the standard inequality measure, the Gini Coefficient, provides similar information regarding the bias, but it is found that Palma index and 20:20 Ratio show the greatest differences and may be useful to provide a different perspective when ranking systems according to bias.