A framework for authorship identification of online messages: Writing-style features and classification techniques

@article{Zheng2006AFF,
  title={A framework for authorship identification of online messages: Writing-style features and classification techniques},
  author={Rong Zheng and Jiexun Li and Hsinchun Chen and Zan Huang},
  journal={J. Assoc. Inf. Sci. Technol.},
  year={2006},
  volume={57},
  pages={378-393}
}
With the rapid proliferation of Internet technologies and applications, misuse of online messages for inappropriate or illegal purposes has become a major concern for society. The anonymous nature of online-message distribution makes identity tracing a critical problem. We developed a framework for authorship identification of online messages to address the identity-tracing problem. In this framework, four types of writing-style features (lexical, syntactic, structural, and content-specific… 

Figures and Tables from this paper

An improved framework for authorship identification in online messages
TLDR
For this work, the C4.5, the fuzzy and also the Ada boost classifiers will be used for the task of authorship-identification and the effects of these classification techniques on online messages is evaluated.
Applying authorship analysis to extremist-group Web forum messages
TLDR
A special multilingual model is developed - the set of algorithms and related features - to identify Arabic messages, gearing this model toward the language's unique characteristics and incorporated a complex message extraction component to allow the use of a more comprehensive set of features tailored specifically toward online messages.
Better Features Sets for Authorship Attribution of Short Messages
TLDR
This research will study how to authenticate a user by the writing style in a short text posted on Twitter, and the effects of different feature sets and sample sizes are evaluated in the research.
Towards an Information Theoretic Model for Online Message Authorship Identification
TLDR
The results show that the proposed model can be used effectively for monitoring and identifying authorship of such documents as emails, chat conversations, web logs, forum posts, and more so for closed sets of users such as a research facility, an enterprise, or an organization.
Design and Implementation of a Machine Learning-Based Authorship Identification Model
TLDR
The proposed LDA-based approach emphasizes instance-based and profile-based classifications of an author’s text that can handle the heterogeneity of the dataset, diversity in writing, and the inherent ambiguity of the Urdu language.
Authorship classification: a syntactic tree mining approach
TLDR
A novel approach to mining discriminative k-embedded-edge subtree patterns from a given set of syntactic trees that reduces the computational burden of using complex syntactic structures as a feature set is proposed and is shown to increase the classification accuracy.
A Machine Learning Framework for Authorship Identification From Texts
TLDR
An approach and a model are presented which learns the differences in writing style between 50 different authors and is able to predict the author of a new text with high accuracy and the accuracy is seen to increase significantly after introducing certain linguistic stylometric features along with text features.
A Machine Learning Framework for Authorship Identification From Texts
TLDR
This work presents an approach and a model which learns the differences in writing style between $50$ different authors and is able to predict the author of a new text with high accuracy and is seen to increase significantly after introducing certain linguistic stylometric features along with text features.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 63 REFERENCES
Style mining of electronic messages for multiple authorship discrimination: first results
TLDR
The results show that stylistic models can be accurately learned to determine an author's identity, based only on the message text.
Authorship Analysis in Cybercrime Investigation
TLDR
The results indicate that the proposed approach to adopt the authorship analysis framework can discover real identities of authors of both English and Chinese Internet messages with relatively high accuracies.
Authorship Attribution with Support Vector Machines
TLDR
The support vector machine (SVM) is applied to the use of text-mining methods for the identification of the author of a text, as it is able to cope with half a million of inputs it requires no feature selection and can process the frequency vector of all words of atext.
Computer-Based Authorship Attribution Without Lexical Measures
TLDR
This paper presents a fully-automated approach to the identification of the authorship of unrestricted text that excludes any lexical measure and adapts aset of style markers to the analysis of the text performed by an already existing natural language processing tool using three stylometric levels.
Mining e-mail content for author identification forensics
TLDR
An investigation into e-mail content mining for author identification, or authorship attribution, for the purpose of forensic investigation found promising results for both aggregated and multi-topic author categorisation.
An experiment in authorship attribution
TLDR
The results of an experiment in authorship attribution are interpreted as supporting the hypothesis that authors have ’ textual fingerprints’, at least for texts produced by authors who are not consciously changing their style of writing across texts.
Gender-preferential text mining of e-mail discourse
TLDR
An extended set of predominantly topic content-free e-mail document features such as style markers, structural characteristics and gender-preferential language features together with a support vector machine learning algorithm gave promising results for author gender categorisation.
Mining E-mail Authorship
TLDR
An investigation into the learning of authorship identication or categorisation for the case of e-mail documents using the Support Vector Machine as the learning method is reported.
Feature-Finding for Text Classification
TLDR
Results of a benchmark test on ten representative text-classification problems suggest that the technique here designated Monte-Carlo Feature-Finding has certain advantages that deserve consideration by future workers in this area.
Automatically Categorizing Written Texts by Author Gender
TLDR
It is shown that automated text categorization techniques can exploit combinations of simple lexical and syntactic features to infer the gender of the author of an unseen formal written document with approximately 80 per cent accuracy.
...
1
2
3
4
5
...