• Corpus ID: 6649691

An evaluation of Naive Bayesian anti-spam filtering

@article{Androutsopoulos2000AnEO,
  title={An evaluation of Naive Bayesian anti-spam filtering},
  author={Ion Androutsopoulos and John Koutsias and Konstantinos V. Chandrinos and Georgios Paliouras and Constantine D. Spyropoulos},
  journal={ArXiv},
  year={2000},
  volume={cs.CL/0006013}
}
It has recently been argued that a Naive Bayesian classifier can be used to filter unsolicited bulk e-mail (“spam”). We conduct a thorough evaluation of this proposal on a corpus that we make publicly available, contributing towards standard benchmarks. At the same time we investigate the effect of attribute-set size, training-corpus size, lemmatization, and stop-lists on the filter’s performance, issues that had not been previously explored. After introducing appropriate cost-sensitive… 

Figures and Tables from this paper

Learning to Filter Spam E-Mail: A Comparison of a Naive Bayesian and a Memory-Based Approach
TLDR
This work investigates thoroughly the performance of the Naive Bayesian filter on a publicly available corpus, contributing towards standard benchmarks and compares it to an alternative memorybased learning approach, after introducing suitable cost-sensitive evaluation measures.
An Evaluation of Naïve Bayesian Anti-Spam Filtering Techniques
TLDR
Based on cost-sensitive measures, it is concluded that additional safety precautions are needed for a Bayesian anti-spam filter to be put into practice, however, the technique can make a positive contribution as a first pass filter.
Classification for Spam Filtering using Naive Bayes
An efficient anti-spam filter that would block all spam, without blocking any legitimate messages is a growing need. To address this problem, we examine the effectiveness of statistically-based
An Anti-spam Filtering System Based on the Naive Bayesian Classifier and Distributed Checksum Clearinghouse
TLDR
An anti-spam filtering system is constructed, which is capable of identifying spam, maintaining normally to deliver and receive e-mails as well and contributes to reduce the system load and enhance the comprehensive properties.
Improved Bayesian Anti-Spam Filter Implementation and Analysis on Independent Spam Corpuses
TLDR
A proposal for spam detection algorithm is presented and its implementation using Java is discussed, along with its performance test results on two independent spam corpuses – Ling-spam and Enron-Spam.
PRIS Kidult Anti-SPAM Solution at the TREC 2005 Spam Track: Improving the Performance of Naive Bayes for Spam Detection
TLDR
This paper reports the solution for the TREC 2005 spam track, in which the use of Naive Bayes spam filter is considered for its desirable properties (simplicity, low time and memory requirements, etc.).
Adaptive Naïve Bayesian Anti-Spam Engine
TLDR
A reevaluation of algorithm's implementation and performance is effectuated from the perspective of over a year and it is suggested that this architecture can increase spam recall without affecting the classifier precision.
A Memory-Based Approach to Anti-Spam Filtering for Mailing Lists
TLDR
An extensive empirical evaluation of memory-based learning in the context of anti-spam filtering, a novel cost-sensitive application of text categorization that attempts to identify automatically unsolicited commercial messages that flood mailboxes, concludes that memory- based anti- Spam filtering for mailing lists is practically feasible, especially when combined with additional safety nets.
Stacking Classifiers for Anti-Spam Filtering of E-Mail
TLDR
It is shown that stacking can improve the efficiency of automatically induced anti-spam filters, and that such filters can be used in real- life applications.
Occam’s razor-based spam filter
TLDR
A novel approach to spam filtering based on the minimum description length principle is presented and the results indicate that the proposed filter is fast to construct, incrementally updateable and clearly outperforms the state-of-the-art spam filters.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 21 REFERENCES
A Bayesian Approach to Filtering Junk E-Mail
TLDR
This work examines methods for the automated construction of filters to eliminate such unwanted messages from a user’s mail stream, and shows the efficacy of such filters in a real world usage scenario, arguing that this technology is mature enough for deployment.
An Analysis of Bayesian Classifiers
TLDR
An average-case analysis of the Bayesian classifier, a simple induction algorithm that fares remarkably well on many learning tasks, and explores the behavioral implications of the analysis by presenting predicted learning curves for artificial domains.
Learning Rules that Classify E-Mail
Two methods for learning text classifiers are compared on classification problems that might arise in filtering and filing personM e-mail messages: a "traxiitionM IR" method based on TF-IDF
Automated learning of decision rules for text categorization
TLDR
It is shown that machine-generated decision rules appear comparable to human performance, while using the identical rule-based representation, and compared with other machine-learning techniques.
Threading Electronic Mail - A Preliminary Study
TLDR
It is proposed that threading of electronic messages be treated as a language processing task, and that a significant level of threading effectiveness can be achieved by applying standard text matching methods from information retrieval to the textual portions of messages.
NewsWeeder: Learning to Filter Netnews
TLDR
The results show that a learning algorithm based on the Minimum Description Length (MDL) principle was able to raise the percentage of interesting articles to be shown to users from 14% to 52% on average.
Beyond Independence: Conditions for the Optimality of the Simple Bayesian Classifier
TLDR
It is shown that the simple Bayesian classi er (SBC) does not in fact assume attribute independence, and can be optimal even when this assumption is violated by a wide margin, and the previously-assumed region of optimality is a second-order in nitesimal fraction of the actual one.
Machine learning in automated text categorization
TLDR
This survey discusses the main approaches to text categorization that fall within the machine learning paradigm and discusses in detail issues pertaining to three different problems, namely, document representation, classifier construction, and classifier evaluation.
Learning to remove Internet advertisements
TLDR
The experiments demonstrate that the inductive learning approach to browsing assistant is practical: the off-line training phase takes less than six minutes; on-line classification takes about 70 msec; and classification accuracy exceeds 97% given a modest set of training data.
Mistake-Driven Learning in Text Categorization
TLDR
This work studies three mistake-driven learning algorithms for a typical task of this nature -- text categorization and presents an algorithm, a variation of Littlestone's Winnow, which performs significantly better than any other algorithm tested on this task using a similar feature set.
...
1
2
3
...