• Corpus ID: 5301578

A Bayesian Approach to Filtering Junk E-Mail

  title={A Bayesian Approach to Filtering Junk E-Mail},
  author={Mehran Sahami and Susan T. Dumais and David Heckerman and Eric Horvitz},
  booktitle={AAAI Conference on Artificial Intelligence},
In addressing the growing problem of junk E-mail on the Internet, we examine methods for the automated construction of filters to eliminate such unwanted messages from a user’s mail stream. [] Key Result Finally, we show the efficacy of such filters in a real world usage scenario, arguing that this technology is mature enough for deployment.

Figures and Tables from this paper


A new approach based on Bayesian classification that can automatically classify e-mail messages as spam or legitimate is explored and its performance for various datasets is studied.

Filtering Junk Mail with a Maximum Entropy Model

This work presents a hybrid approach, utilizing a Maximum Entropy Model, and shows how to use it in a junk mail filtering task and presents an extensive experimental comparison of this approach with a Naive Bayes classifier, a widely used classifier in e-mail filtering task, and show that this approach performs comparable or better than Naives Bayes method.

Using Naïve Bayes Method to Classify Text-Based Email

Results are presented from a number of experiments and show that a filtering system, BETSY, could become a useful and valuable part of any e-mail client.

Automatic junk e-mail filtering based on latent content

Experiments show that the underlying framework is latent semantic analysis, which is competitive with the state-of-the-art in e-mail classification, and potentially advantageous in real-world applications with high junk-to-legitimate ratios.

A Neural Network Classifier for Junk E-Mail

This preliminary study tests this alternative approach using a neural network (NN) classifier on a corpus of e-mail messages from one user, and it appears that commercial spam detectors are now beginning to use descriptive features as proposed here.

Learning to Filter Unsolicited Commercial E-Mail

The architecture of a fully implemented learning-based anti-spam filter is described, and an analysis of its behavior in real use over a period of seven months is presented.

Ways to Evade Spai Filters and Machine Learning as a Potential Solution

  • V. ChandraN. Shrivastava
  • Computer Science
    2006 International Symposium on Communications and Information Technologies
  • 2006
A critical analysis of the various ways adopted by spammers to dodge the spam filters is presented, and the Bayesian noise reduction (BNR) technique is explored which attempts to solve this problem by identifying and eliminating the 'out of context' data (so injected by spams or otherwise) to provide a cleaner classification.


This paper presents a mathematical approach to restrict the spam e-mails through subject and content relevancy of the e-mail, and results of this approach are used to classify the E-mail to be spam.

An experimental comparison of naive Bayesian and keyword-based anti-spam filtering with personal e-mail messages

This work introduces appropriate cost-sensitive measures, and investigates at the same time the effect of attribute-set size, training-corpus size, lemmatization, and stop lists, issues that have not been explored in previous experiments.

Spam / Junk E-Mail Filter Technique

The preliminary study tests an alternative approach using a neural network (NN) classifier to overcome drawbacks of Naïve Bayesian approach and uses a feature set that uses descriptive characteristics of words and messages similar in the way that users would use to identify spam.



Learning Rules that Classify E-Mail

Two methods for learning text classifiers are compared on classification problems that might arise in filtering and filing personM e-mail messages: a "traxiitionM IR" method based on TF-IDF

Learning Limited Dependence Bayesian Classifiers

A framework for characterizing Bayesian classification methods is presented and a general induction algorithm is presented that allows for traversal of this spectrum depending on the available computational power for carrying out induction and its application in a number of domains with different properties.

Improving Text Classification by Shrinkage in a Hierarchy of Classes

This paper shows that the accuracy of a naive Bayes text classi er can be improved by taking advantage of a hierarchy of classes, and adopts an established statistical technique called shrinkage that smoothes parameter estimates of a data-sparse child with its parent in order to obtain more robust parameter estimates.

Toward Optimal Feature Selection

An efficient algorithm for feature selection which computes an approximation to the optimal feature selection criterion is given, showing that the algorithm effectively handles datasets with a very large number of features.

A comparison of two learning algorithms for text categorization

It is shown that both algorithms achieve reasonable performance and allow controlled tradeoos between false positives and false negatives, and the stepwise feature selection in the decision tree algorithm is particularly eeective in dealing with the large feature sets common in text categorization.

Machine learning

Machine learning addresses many of the same research questions as the fields of statistics, data mining, and psychology, but with differences of emphasis.

Probabilistic reasoning in intelligent systems - networks of plausible inference

  • J. Pearl
  • Computer Science
    Morgan Kaufmann series in representation and reasoning
  • 1989
The author provides a coherent explication of probability as a language for reasoning with partial belief and offers a unifying perspective on other AI approaches to uncertainty, such as the Dempster-Shafer formalism, truth maintenance systems, and nonmonotonic logic.

Text Categorization with Support Vector Machines: Learning with Many Relevant Features

This paper explores the use of Support Vector Machines (SVMs) for learning text classifiers from examples. It analyzes the particular properties of learning with text data and identifies why SVMs are

Elements of Information Theory

The author examines the role of entropy, inequality, and randomness in the design of codes and the construction of codes in the rapidly changing environment.

The Nature of Statistical Learning Theory

  • V. Vapnik
  • Computer Science
    Statistics for Engineering and Information Science
  • 2000
Setting of the learning problem consistency of learning processes bounds on the rate of convergence of learning processes controlling the generalization ability of learning processes constructing