Estimating the Date of First Publication in a Large-Scale Digital Library


One prerequisite for cultural analysis in large-scale digital libraries is an accurate estimate of the date of composition of the text-as distinct from the date of publication of an edition-for the works they contain. In this work, we present a manually annotated dataset of first dates of publication of three samples of books from the HathiTrust Digital Library (uniform random, uniform fiction, and stratified by decade), and empirically evaluate the disparity between these gold standard labels and several approximations used in practice (using the date of publication as provided in metadata, several deduplication methods, and automatically predicting the date of composition from the text of the book). We find that a simple heuristic of metadata-based deduplication works best in practice, and text-based composition dating is accurate enough to inform the analysis of apparent time.

DOI: 10.1109/JCDL.2017.7991569

10 Figures and Tables

Cite this paper

@article{Bamman2017EstimatingTD, title={Estimating the Date of First Publication in a Large-Scale Digital Library}, author={David Bamman and Michelle Carney and Jon Gillick and Cody Hennesy and Vijitha Sridhar}, journal={2017 ACM/IEEE Joint Conference on Digital Libraries (JCDL)}, year={2017}, pages={1-10} }