Impact of Crowdsourcing OCR Improvements on Retrievability Bias

@article{Traub2018ImpactOC,
  title={Impact of Crowdsourcing OCR Improvements on Retrievability Bias},
  author={Myriam C. Traub and Thaer Samar and Jacco van Ossenbruggen and Lynda Hardman},
  journal={Proceedings of the 18th ACM/IEEE on Joint Conference on Digital Libraries},
  year={2018}
}
Digitized document collections often suffer from OCR errors that may impact a document's readability and retrievability. We studied the effects of correcting OCR errors on the retrievability of documents in a historic newspaper corpus of a digital library. We computed retrievability scores for the uncorrected documents using queries from the library's search log, and found that the document OCR character error rate and retrievability score are strongly correlated. We computed retrievability… Expand
Assessing the Impact of OCR Quality on Downstream NLP Tasks
TLDR
A series of extrinsic assessment tasks are performed using popular, out-of-the-box tools in order to quantify the impact of OCR quality on these tasks and find a consistent impact resulting from OCR errors on downstream tasks with some tasks more irredeemably harmed by O CR errors. Expand
Music in newspapers: interdisciplinary opportunities and data-related challenges
TLDR
This paper discusses how considering music-related mentionings in newspapers can enable potential new research directions and questions, and discusses open syntactic and semantic data-related technical challenges when analyzing music- related mentioning in digitized historical newspaper collections. Expand
The influences of social value orientation and domain knowledge on crowdsourcing manuscript transcription
TLDR
The analysis confirmed that in crowdsourced manuscript transcription, social value orientation has a significant effect on participants’ cooperation level and transcription quality; domain knowledge has asignificant effect on Participants’ transcription quality, but not on their cooperation level. Expand

References

SHOWING 1-10 OF 18 REFERENCES
Querylog-based assessment of retrievability bias in a large newspaper corpus
TLDR
The effectiveness of the retrievability measure is investigated using a large digitized newspaper corpus, featuring two characteristics that distinguishes its experiments from previous studies: (1) compared to TREC collections, this collection contains noise originating from OCR processing, historical spelling and use of language; and (2) instead of simulated queries, the collection comes with real user query logs including click data. Expand
User-driven correction of OCR errors: combining crowdsourcing and information retrieval technology
TLDR
A new approach to the correction of noisy OCR text which combines the power of crowdsourcing with information retrieval technology, based on standard technology (JAVA, Lucene, Ajax). Expand
Impact of OCR Errors on the Use of Digital Libraries: Towards a Better Access to Information
TLDR
The impact of OCR errors on the use of a major online platform: The Gallica digital library from the National Library of France is estimated, underlining the critical extent to which OCR quality impacts on digital library access. Expand
Efficiently Estimating Retrievability Bias
TLDR
This paper examines how many queries are needed to obtain a reliable and useful approximation of the retrievability bias imposed by the system, and an estimate of the individual retrieevability of documents in the collection. Expand
Effects of OCR Errors on Ranking and Feedback Using the Vector Space Model
TLDR
It is shown that average precision and recall is not affected for the full text document collection when the OCR version is compared to its corresponding corrected set and that even though feedback improves retrieval for both collections, it can not be used to compensate for OCR errors caused by badly degraded documents. Expand
Quantifying retrieval bias in Web archive search
TLDR
It is shown that the retrievability of documents can vary for different versions of the same document and that retrieval systems induce biases to different extents, and the effect of bias across time is studied using the retrieevability measure. Expand
Estimating retrievability ranks of documents using document features
TLDR
This paper uses document features based approach in order to estimate the retrievability ranks of documents, and finds that this approach requires fewer resources, and can be computed more quickly as compared to query based approach. Expand
Impact Analysis of OCR Quality on Research Tasks in Digital Archives
TLDR
The current knowledge situation on the users’ and data providers’ side is insufficient and needs to be improved and the classification of scholarly research tasks is provided that gives account of their susceptibility to specific OCR-induced biases and the data required for uncertainty estimations. Expand
Retrieval Models Versus Retrievability
TLDR
This chapter explains the concept of retrievability in information retrieval, how it can be estimated and how it could be used for analysing a retrieval bias of retrieval models, and how the retrievable measure can be used to improve effectiveness. Expand
Retrievability and Retrieval Bias: A Comparison of Inequality Measures
TLDR
This work suggests that the standard inequality measure, the Gini Coefficient, provides similar information regarding the bias, but it is found that Palma index and 20:20 Ratio show the greatest differences and may be useful to provide a different perspective when ranking systems according to bias. Expand
...
1
2
...