Extended nearest shrunken centroid classification: A new method for open-set authorship attribution of texts of varying sizes

  title={Extended nearest shrunken centroid classification: A new method for open-set authorship attribution of texts of varying sizes},
  author={G. Bruce Schaalje and Paul J. Fields and Matthew Roper and Gregory L. Snow},
  journal={Lit. Linguistic Comput.},
The nearest shrunken centroid (NSC) methodology, originally developed for high-dimensional genomics problems, was recently applied in a stylometric study. Although NSC has many advantages, stylometric problems usually differ from genomics problems in several important ways: texts are of a wide range of sizes, a large series of texts are often the subjects for classification, and most importantly the set of candidate authors cannot usually be assumed to be closed. Consequently, naı̈ve… 
Do birds of a feather really flock together, or how to choose training samples for authorship attribution
A bootstrap-like approach is used to choose randomly, in 500 iterations, the samples for the training and the test sets, inspired by k-fold cross-validation procedures, and shows considerable resistance of the English corpus to permutations, while the other corpora turned out to be more dependent on the choice of the samples.
Profile-based authorship analysis
The article shows that this profile-based authorship analysis method is more accurate than existing approaches given a data set with hundreds of authors and makes new types of analysis possible by looking at types of individuals as well as at specific individuals.
An open-set size-adjusted Bayesian classifier for authorship attribution
A customized Bayesian logit‐normal‐beta‐binomial classification model for supervised authorship attribution that uses Markov Chain Monte Carlo to produce distributions of posterior authorship probabilities instead of point estimates is proposed.
Modelling the Interpretation of Literary Allusion with Machine Learning Techniques
This work begins with a large set of textual parallels, and then attempts to model which of these instances of text reuse are meaningful literary allusions and which are not, according to a group of human readers.
Examining a Misapplication of Nearest Shrunken Centroid Classification to Investigate Book of Mormon Authorship
Review of Matthew L. Jockers, Daniela M. Witten, and Craig S. Criddle. "Reassessing authorship of the Book of Mormon using delta and nearest shrunken centroid classification."
Authorship Attribution for Social Media Forensics
It is argued that there is a significant need in forensics for new authorship attribution algorithms that can exploit context, can process multi-modal data, and are tolerant to incomplete knowledge of the space of all possible authors at training time.
Literary Detective Work on the Computer
The theoretical background to authorship attribution is presented in a step by step manner, and comprehensive reviews of the field are given in two specialist areas, the writings of William Shakespeare and his contemporaries, and the various writing styles seen in religious texts.
Overcoming the challenge for text classification in the open world
  • T. Doan, J. Kalita
  • Computer Science
    2017 IEEE 7th Annual Computing and Communication Workshop and Conference (CCWC)
  • 2017
The Nearest Centroid Class (NCC) is represented which is incremental learning and able to detect unknown class during testing and yields promising results in a document classification on text classification domains among current state-of-the-art models.
Breaking a challenge for text classification in open world recognition
This work introduces the Nearest Centroid Class classifier which is able to detect and learn unknown classes incrementally and experiments the model for document classification on different domains which shows the proposed algorithm yields promising results.


Reassessing authorship of the Book of Mormon using delta and nearest shrunken centroid classification
The findings support the hypothesis that Rigdon was the main architect of the Book of Mormon and are consistent with historical evidence suggesting that he fabricated the book by adding theology to the unpublished writings of Spalding (then deceased).
A comparative study of machine learning methods for authorship attribution
Each of the methods tested performed well, but nearest shrunken centroids and regularized discriminant analysis had the best overall performances with 0/70 cross-validation errors.
Open-Set Nearest Shrunken Centroid Classification
Nearest Shrunken Centroid (NSC) classification has proven successful in ultra-high-dimensional classification problems involving thousands of features measured on relatively few individuals, such as
Testing Authorship in the Personal Writings of Joseph Smith Using NSC Classification
The work presented here reevaluates the decision to exclude Joseph Smith and employs both supervised classification and unsupervised clustering in order to explore the stylistic consistency between documents attributed to Smith (but written in the handwriting of Smith's 24 different scribes) and documents in Smith's own hand.
Computational methods in authorship attribution
Three scenarios are considered here for which solutions to the basic attribution problem are inadequate; it is shown how machine learning methods can be adapted to handle the special challenges of that variant.
Authorship Attribution
  • P. Juola
  • Art
    Found. Trends Inf. Retr.
  • 2006
This review shows that the authorship attribution discipline is quite successful, even in difficult cases involving small documents in unfamiliar and less studied languages; it further analyzes the types of analysis and features used and tries to determine characteristics of well-performing systems, finally formulating these in a set of recommendations for best practices.
'Delta': a Measure of Stylistic Difference and a Guide to Likely Authorship
A new way of using the relative frequencies of the very common words for comparing written texts and testing their likely authorship, which offers a simple but comparatively accurate addition to current methods of distinguishing the most likely author of texts exceeding about 1,500 words in length.
Testing Burrows's Delta
  • D. Hoover
  • Linguistics
    Lit. Linguistic Comput.
  • 2004
Test of Delta's effectiveness and accuracy shows that it works nearly as well on prose as it does on poetry, and suggests that combining several texts for each author in the primary set reduces the effect of intra-author variability.
Bayesian Analysis of a Multinomial Sequence and Homogeneity of Literary Style
To help settle the debate around the authorship of Tirant lo Blanc, all words in each chapter are categorized according to their length, and the appearances of certain words are counted, thus forming
Classification of microarrays to nearest centroids
It is shown that the modified t-statistics and shrunken centroids employed by PAM tend to increase misclassification error when compared with their simpler counterparts, and a classification method called 'Classification to Nearest Centroids' (ClaNC), which is arguably simpler and easier to interpret than PAM.