Mapping mutable genres in structurally complex volumes

  title={Mapping mutable genres in structurally complex volumes},
  author={Ted Underwood and Michael L. Black and Loretta Auvil and Boris Capitanu},
  journal={2013 IEEE International Conference on Big Data},
To mine large digital libraries in humanistically meaningful ways, we need to divide them by genre. [] Key Method We describe a multilayered solution that trains hidden Markov models to segment volumes, and uses ensembles of overlapping classifers to address historical change. We demonstrate this on a collection of 469,200 volumes drawn from HathiTrust Digital Library.

Figures and Tables from this paper

Incremental Dataset Definition for Large Scale Musicological Research
The results show that effective training of a classifier is possible with the method which greatly reduces the effort of labelling where a residual error rate is acceptable and the trade-off between accuracy and required number of annotated samples is evaluated.
Genre Classification on German Novels
This paper addresses the issue of genre classification in the context of a large set of novels using machine learning methods in order to achieve a better understanding of the genre of novels.
Analyses of Characters in Dramatic Works by Using Document Embeddings
According to this work, it is possible to detect Deus-ex-Machina characters and examples of strong unity-of-action principle plays could be demonstrated as well as distinct characters.
The fictionality of topic modeling: Machine reading Anthony Trollope's Barsetshire series
This essay describes how using unsupervised topic modeling (specifically the latent Dirichlet allocation topic modeling algorithm in MALLET) on relatively small corpuses can help scholars of
Multi-perspective Event Detection in Texts Documenting the 1944 Battle of Arnhem
A proof-of-concept workflow for the semi-automatic detection and linking of narratives referring to the same event based on references to location names is presented, which cannot rely on standard named-entity recognition but need to develop fine-grained detection of street names, to capture the scenes that connect multi-perspective narratives.
Library Collections as Humanities Data: The Facet Effect
In what follows the authors work through a high level discussion of relevant literature on concepts of information and data to arrive at a definition of Humanities data.
The Fictionality Of Topic Modeling : Machine
This essay describes how using unsupervised topic modeling (specifically the latent Dirichlet allocation topic modeling algorithm in MALLET) on relatively small corpuses can help scholars of
Analyses of Literary Texts by Using Statistical Inference Methods
Although the results of the classification of the side characters in the plays are not always what one would have expected based on the reading of the plays, there are observations on dramatic fiction, which is also verified by literary theory.
A Bayesian Mixed Effects Model of Literary Character
A model that employs multiple effects to account for the influence of extra-linguistic information (such as author) is introduced and it is found that this method leads to improved agreement with the preregistered judgments of a literary scholar, complementing the results of alternative models.
Six Degrees of Francis Bacon: A Statistical Method for Reconstructing Large Historical Social Networks
The results of this process, a global visualization of Britain’s early modern social network, will be useful to scholars and students of the period, and the pipeline developed can be reused by other scholars to generate networks for other historical or contemporary societies from biographical documents.


Semi-Supervised Text Classification Using EM
Deterministic annealing, a variant of EM, can help overcome the problem of local maxima and increase classification accuracy further when the generative model is appropriate.
Quantitative Analysis of Culture Using Millions of Digitized Books
L'article, publie dans Science, sur une des premieres utilisations analytiques de Google Books, fondee sur les n-grammes (Google Ngrams) We constructed a corpus of digitized texts containing about 4%
Evolution of the Novel in the United States
This article examines the evolution of the novel in the United States using a remarkable new source, the Ngram database. This database, which spans several centuries, draws on the 15 million books
Mauvais Genres
What is genre theory a theory of? This paper argues that genre theory is an inquiry into the formation of fundamental categories of interpretation in history, where even the question of whether
Feature selection, L1 vs. L2 regularization, and rotational invariance
  • A. Ng
  • Computer Science
    Twenty-first international conference on Machine learning - ICML '04
  • 2004
A lower-bound is given showing that any rotationally invariant algorithm---including logistic regression with L1 regularization, SVMs, and neural networks trained by backpropagation---has a worst case sample complexity that grows at least linearly in the number of irrelevant features.
Grammar-Based Recognition of Documentary Forms and Extraction of Metadata
In an experiment, the document type recognizer successfully recognized the documentary form and extracted the metadata of two-thirds of the records in a series of Presidential e-records containing twenty-one document types.
An evaluation of text classification methods for literary study
  • Bei Yu
  • Computer Science
    Lit. Linguistic Comput.
  • 2008
The experiment results provide new insights to the relation between classification methods, feature engineering options and non-topic document properties, and they also provide guidance for classification method selection in literary text classification applications.
An Evaluation of Text Classification Methods for Literary Study
Text classification methods have been evaluated on topic classification tasks. This thesis extends the empirical evaluation to emotion classification tasks in the literary domain. This study selects
The WEKA data mining software: an update
This paper provides an introduction to the WEKA workbench, reviews the history of the project, and, in light of the recent 3.6 stable release, briefly discusses what has been added since the last stable version (Weka 3.4) released in 2003.
Anomalies of Genre: The Utility of Theory and History for the Study of Literary Genres
In this commentary I will consider a topical thread that runs through most of the essays comprising this issue of New Literary History and the one before it. The topic is the relation between history