• Corpus ID: 5515791

"In vivo" spam filtering: A challenge problem for data mining

@article{Fawcett2004InVS,
  title={"In vivo" spam filtering: A challenge problem for data mining},
  author={Tom Fawcett},
  journal={ArXiv},
  year={2004},
  volume={cs.AI/0405007}
}
Spam, also known as Unsolicited Commercial Email (UCE), is the bane of email communication. Many data mining researchers have addressed the problem of detecting spam, generally by treating it as a static text classification problem. True in vivo spam filtering has characteristics that make it a rich and challenging domain for data mining. Indeed, real-world datasets with these characteristics are typically difficult to acquire and to share. This paper demonstrates some of these characteristics… 

Figures from this paper

Spam Mail Filtering Through Data Mining Approach -A Comparative Performance Analysis
TLDR
In this paper, spam dataset is analyzed using CLEMENTINE data mining tool for email spam classification and best classifier forEmail spam is identified based on the Training and testing accuracy of various models and Performance measures.
A survey of learning-based techniques of email spam filtering
TLDR
An overview of the state of the art of machine learning applications for spam filtering, and of the ways of evaluation and comparison of different filtering methods.
R-SALSA: A spam filtering technique for social networking sites
  • M. Agrawal, R. L. Velusamy
  • Computer Science
    2016 IEEE Students' Conference on Electrical, Electronics and Computer Science (SCEECS)
  • 2016
TLDR
An unsupervised approach namely Reliability based Stochastic Approach for Link-Structure Analysis (R-SALSA) algorithm has been proposed in this paper for classifying a message being Spam or benign and it is found to be performing better than previously proposed unsuper supervised author-reporter model.
Application of Learning Algorithms to Image Spam Evolution
TLDR
This chapter identifies eight features for the detection of computer generated image spam versus ham (non-spam) and uses J48 and J48 with reduced error pruning decision trees to classify the images.
A Fuzzy approach for Spam Mail Detection integrated with wordnet hypernyms key term extraction
TLDR
This paper proposes an efficient yet simple fuzzy based simple method applied on refined key terms set extracted from the email using wordnet and hypernyms concept to filter spam mail.
The Impact of Feature Selection on Signature-Driven Spam Detection
TLDR
This work proposes a technique for increasing signature robustness, targeting the I-Match algorithm, but applicable to other single-signature detection schemes, and shows that distributional word clustering is demonstrated to be effective in increasing signatures robustness.
Survey on Internet Spam : Classification and Analysis
  • Computer Science
  • 2013
TLDR
The impact of various spams in social networks, email, image, content and links is discussed, the technique applied to prevent the spam in various areas is listed and the things to be considered to construct the spam algorithms are listed.
Online supervised spam filter evaluation
TLDR
Eleven variants of six widely used open-source spam filters are tested on a chronological sequence of 49086 e-mail messages received by an individual from August 2003 through March 2004, indicating that content-based filters can eliminate 98% of spam while incurring 0.1% legitimate email loss.
Machine Learning for E-mail Spam Filtering: Review, Techniques and Trends
TLDR
A comprehensive review of the most effective content-based e-mail spam filtering techniques, focusing primarily on Machine Learning-based spam filters and their variants, and exploring the promising offshoots of latest developments.
Trusting spam reporters: A reporter-based reputation system for email filtering
TLDR
This work proposes a reactive spam- Filtering system based on reporter reputation for use in conjunction with existing spam-filtering techniques, and reports on the utility of a reputation system for spam filtering that makes use of the feedback of trustworthy users.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 53 REFERENCES
A Case-Based Approach to Spam Filtering that Can Track Concept Drift
TLDR
A case-based approach to spam filtering allows for the sharing of cases and thus a sharing of the effort of labeling email as spam.
An evaluation of Naive Bayesian anti-spam filtering
TLDR
It is reached that additional safety nets are needed for the Naive Bayesian anti-spam filter to be viable in practice.
An experimental comparison of naive Bayesian and keyword-based anti-spam filtering with personal e-mail messages
TLDR
This work introduces appropriate cost-sensitive measures, and investigates at the same time the effect of attribute-set size, training-corpus size, lemmatization, and stop lists, issues that have not been explored in previous experiments.
A Memory-Based Approach to Anti-Spam Filtering for Mailing Lists
TLDR
An extensive empirical evaluation of memory-based learning in the context of anti-spam filtering, a novel cost-sensitive application of text categorization that attempts to identify automatically unsolicited commercial messages that flood mailboxes, concludes that memory- based anti- Spam filtering for mailing lists is practically feasible, especially when combined with additional safety nets.
SVM-based Filtering of E-mail Spam with Content-specic Misclassication Costs
We address the problem of separating legitimate emails from uncolicited ones in the context of a large-scale operation, where the diversity of user accounts is very high, while misclassiÞcation costs
Evaluating cost-sensitive Unsolicited Bulk Email categorization
TLDR
This paper discusses cost-sensitive Text Categorization methods for UBE filtering, and uses the Receiver Operating Characteristic Convex Hull method for the evaluation, that best suits classification problems in which target conditions are not known.
A Comparison of Event Models for Naive Bayes Anti-Spam E-Mail Filtering
TLDR
It is found that the multinomial model is less biased towards one class and achieves slightly higher accuracy than the multi-variate Bernoulli model.
A Bayesian Approach to Filtering Junk E-Mail
TLDR
This work examines methods for the automated construction of filters to eliminate such unwanted messages from a user’s mail stream, and shows the efficacy of such filters in a real world usage scenario, arguing that this technology is mature enough for deployment.
Na ive-Bayes vs. Rule-Learning in Classification of Email
TLDR
Three experiments in automatic mail foldering and spam filtering are presented, showing that bayesian learning with bag-valued features and the RIPPER rule-learning algorithm outperforms RIPPER in classification accuracy.
Fraud detection
TLDR
This paper discusses general characteristics of fraud detection problems that make them difficult, as well as system integration issues for automatic fraud detection systems.
...
1
2
3
4
5
...