• Corpus ID: 1541082

Authorship Attribution of E-Mail: Comparing Classifiers over a New Corpus for Evaluation

@inproceedings{Allison2008AuthorshipAO,
  title={Authorship Attribution of E-Mail: Comparing Classifiers over a New Corpus for Evaluation},
  author={Ben Allison and Louise Guthrie},
  booktitle={LREC},
  year={2008}
}
The release of the Enron corpus provided a unique resource for studying aspects of email use, because it is largely unfiltered, and therefore presents a relatively complete collection of emails for a reasonably large number of correspondents. This paper describes a newly created subcorpus of the Enron emails which we suggest can be used to test techniqes for authorship attribution, and further shows the application of three different classification methods to this task to present baseline… 

Figures and Tables from this paper

A simple but Powerful E-mail Authorship Attribution System
TLDR
A comprehensive study of relation between accuracy of e-mails authorship attribution and e-mail length and a simple but powerful and robust method based on ensemble method and Naïve Bayes classifier is proposed.
E-mail authorship attribution using customized associative classification
Authorship Identification in Large Email Collections: Experiments Using Features that Belong to Different Linguistic Levels - Notebook for PAN at CLEF 2011
The aim of this paper is to explore the usefulness of using features from different linguistic levels to email authorship identification. Using various email datasets provided by PAN'11 lab we tested
Influence of machine learning techniques on Authorship attribution for Telugu text features
TLDR
In this paper character level features and lexical features are considered for feature extraction and dimensionality of the feature space is reduced using chi-square measure.
Authorship Attribution Using Stylometry and Machine Learning Techniques
TLDR
This paper aims at studying the use of stylometric features present in a document in order to verify its authorship, and shows how authorship attribution can be used to identify potential cases of plagiarism in formal writings.
BertAA : BERT fine-tuning for Authorship Attribution
TLDR
BertAA is introduced, a fine-tuning of a pre-trained BERT language model with an additional dense layer and a softmax activation to perform authorship classification to reach competitive performances on Enron Email, Blog Authorship, and IMDb datasets.
An Improved Hierarchical Bayesian Model of Language for Document Classification
TLDR
In the course of the paper, an approximate sampling distribution for word counts in documents is advocated, and the model's capacity to outperform both the simple multinomial and more recently proposed extensions on the classification task is demonstrated.
A comparison of classifiers and features for authorship authentication of social networking messages
TLDR
Algorithms and classifiers to determine the authenticity of short social network postings, an average of 20.6 words, from Facebook are developed and several experiments using a variety of classifiers are discussed, indicating varying degrees of success compared with previous studies.
Authorship attribution of SMS messages using an N-grams approach
TLDR
An N-grams based approach for determining the authorship of SMS messages shows encouraging results in identification of authors despite the scarcity of words in SMS messages and the differences between SMS language and natural language characteristics.
Evaluating authorship distance methods using the positive Silhouette coefficient
TLDR
The Positive Silhouette Coefficient is introduced, given as the proportion of instances with a positive SC value, which is not easily altered by outliers and produces a more robust metric.
...
1
2
...

References

SHOWING 1-10 OF 24 REFERENCES
The Enron Corpus: A New Dataset for Email Classification Research
TLDR
The Enron corpus is introduced as a new test bed for email folder prediction, and the baseline results of a state-of-the-art classifier (Support Vector Machines) are provided under various conditions.
Authorship Attribution with Support Vector Machines
TLDR
The support vector machine (SVM) is applied to the use of text-mining methods for the identification of the author of a text, as it is able to cope with half a million of inputs it requires no feature selection and can process the frequency vector of all words of atext.
A comparison of event models for naive bayes text classification
TLDR
It is found that the multi-variate Bernoulli performs well with small vocabulary sizes, but that the multinomial performs usually performs even better at larger vocabulary sizes--providing on average a 27% reduction in error over the multi -variateBernoulli model at any vocabulary size.
Improving Text Classification by Shrinkage in a Hierarchy of Classes
TLDR
This paper shows that the accuracy of a naive Bayes text classi er can be improved by taking advantage of a hierarchy of classes, and adopts an established statistical technique called shrinkage that smoothes parameter estimates of a data-sparse child with its parent in order to obtain more robust parameter estimates.
A re-examination of text categorization methods
TLDR
The results show that SVM, kNN and LLSF signi cantly outperform NNet and NB when the number of positive training instances per category are small, and that all the methods perform comparably when the categories are over 300 instances.
Parametric Models of Linguistic Count Data
TLDR
This work proposes using zero-inflated models for dealing with occurrence counts of words in documents, and evaluates competing models on a Naive Bayes text classification task.
An algorithm for suffix stripping
TLDR
An algorithm for suffix stripping is described, which has been implemented as a short, fast program in BCPL and performs slightly better than a much more elaborate system with which it has been compared.
Inductive learning algorithms and representations for text categorization
TLDR
A comparison of the effectiveness of five different automatic learning algorithms for text categorization in terms of learning speed, realtime classification speed, and classification accuracy is compared.
Applying authorship analysis to extremist-group Web forum messages
TLDR
A special multilingual model is developed - the set of algorithms and related features - to identify Arabic messages, gearing this model toward the language's unique characteristics and incorporated a complex message extraction component to allow the use of a more comprehensive set of features tailored specifically toward online messages.
Modeling word burstiness using the Dirichlet distribution
TLDR
The Dirichlet compound multinomial model (DCM) is proposed, which has one additional degree of freedom, which allows it to capture burstiness of words in a document, and performance is comparable to that obtained with multiple heuristic changes to the mult inomial model.
...
1
2
3
...