• Corpus ID: 18749628

Spam Filtering with Naive Bayes - Which Naive Bayes?

@inproceedings{Metsis2006SpamFW,
  title={Spam Filtering with Naive Bayes - Which Naive Bayes?},
  author={Vangelis Metsis and Ion Androutsopoulos and Georgios Paliouras},
  booktitle={CEAS},
  year={2006}
}
Naive Bayes is very popular in commercial and open-source anti-spam e-mail filters. There are, however, several forms of Naive Bayes, something the anti-spam literature does not always acknowledge. We discuss five dierent versions of Naive Bayes, and compare them on six new, non-encoded datasets, that contain ham messages of particular Enron users and fresh spam messages. The new datasets, which we make publicly available, are more realistic than previous comparable benchmarks, because they… 
Advances in Spam Filtering Techniques
TLDR
This chapter presents and compares seven different versions of Naive Bayes classifiers, the well-known linear Support Vector Machine and a new method based on the Minimum Description Length principle, and conducts an empirical experiment that indicates that the proposed filter is easy to implement, incrementally updateable and clearly outperforms the state-of-the-art spam filters.
Content-based spam filtering
TLDR
This paper discusses seven different versions of Naive Bayes classifiers, and compares them with the well-known Linear Support Vector Machine on six non-encoded datasets, and proposes a new measurement in order to evaluate the quality of anti-spam classifiers.
Bayesian Spam Detection
TLDR
This paper will show that Bayesian filtering can be simply implemented for a reasonably accurate text classifier and that it can be modified to make a significant impact on the accuracy of the filter.
Probabilistic anti-spam filtering with dimensionality reduction
TLDR
This paper compares the performance of most popular methods used as term selection techniques with some variations of the original naive Bayes anti-spam filter.
Personalized Spam Filtering with Natural Language Attributes
TLDR
Comparisons show that the performance of Sentinel surpasses that of a number of state-of-the-art personalized filters proposed in previous studies, and uses attributes related to natural language stylometry.
Better Naive Bayes classification for high-precision spam detection
TLDR
This work addresses the problem of low-FPR classification in the context of naive Bayes, which represents one of the most popular machine learning models applied in the spam filtering domain, with a new term weight aggregation function, which leads to markedly better results than the standard alternatives.
An online subject-based spam filter using natural language features
TLDR
An online subject-based spam filter built upon an extended version of weighted naive Bayesian (WNB) classifier that is immune to the spams with malicious campaigns beyond contemplation is proposed.
Not So Naive Online Bayesian Spam Filter
TLDR
The experiment results show that the NSNB does give state-of-the-art classification on online spam filtering on large benchmark data sets while it is extremely fast and takes up little memory in comparison with other statistical methods.
Evaluation of Approaches for Dimensionality Reduction Applied with Naive Bayes Anti-Spam Filters
TLDR
This paper compares the performance of most popular methods used as term selection techniques, such as document frequency, information gain, mutual information, X 2 statistic, and odds ratio used for reducing the dimensionality of the term space with four well-known different versions of naive Bayes spam filter.
Better Naive Bayes classification for high‐precision spam detection
TLDR
This work addresses the problem of low‐FPR classification in the context of naive Bayes, which represents one of the most popular machine learning models applied in the spam filtering domain, and proposes a new term weight aggregation function, which leads to markedly better results than the standard alternatives.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 26 REFERENCES
An experimental comparison of naive Bayesian and keyword-based anti-spam filtering with personal e-mail messages
TLDR
This work introduces appropriate cost-sensitive measures, and investigates at the same time the effect of attribute-set size, training-corpus size, lemmatization, and stop lists, issues that have not been explored in previous experiments.
Learning to Filter Unsolicited Commercial E-Mail
TLDR
The architecture of a fully implemented learning-based anti-spam filter is described, and an analysis of its behavior in real use over a period of seven months is presented.
A Comparison of Event Models for Naive Bayes Anti-Spam E-Mail Filtering
TLDR
It is found that the multinomial model is less biased towards one class and achieves slightly higher accuracy than the multi-variate Bernoulli model.
Boosting Trees for Anti-Spam Email Filtering
TLDR
The boosting-based methods clearly outperform the baseline learning algorithms on the PU1 corpus, achieving very high levels of the F1 measure and obtaining better ``high-precision'' classifiers, which is a very important issue when misclassification costs are considered.
A Memory-Based Approach to Anti-Spam Filtering for Mailing Lists
TLDR
An extensive empirical evaluation of memory-based learning in the context of anti-spam filtering, a novel cost-sensitive application of text categorization that attempts to identify automatically unsolicited commercial messages that flood mailboxes, concludes that memory- based anti- Spam filtering for mailing lists is practically feasible, especially when combined with additional safety nets.
SpamCop: A Spam Classification & Organisation Program
We present a simple, yet highly accurate, spam filtering program, called SpamCop, which is able to identify about 92% of the spams while misclassifying only about 1.16% of the nonspam e-mails.
Naive Bayes spam filtering using word-position-based attributes and length-sensitive classification thresholds
TLDR
The author’s implementation using wordposition-based attribute vectors gave very good results when tested on several publicly available corpora, and an efficient weighting scheme for cost-sensitive classification is introduced.
Stacking Classifiers for Anti-Spam Filtering of E-Mail
TLDR
It is shown that stacking can improve the efficiency of automatically induced anti-spam filters, and that such filters can be used in real- life applications.
"In vivo" spam filtering: a challenge problem for KDD
TLDR
This paper argues that researchers should pursue in vivo spam filtering as an accessible domain for investigating them and demonstrates some of the characteristics that make it a rich and challenging domain for data mining.
Automatic Categorization of Email into Folders: Benchmark Experiments on Enron and SRI Corpora
TLDR
An extensive benchmark study of email foldering using two large corpora of real-world email messages and foldering schemes: one from former Enron employees, another from participants in an SRI research project.
...
1
2
3
...