• Corpus ID: 416685

# Learning to Identify Regular Expressions that Describe Email Campaigns

@article{Prasse2012LearningTI,
title={Learning to Identify Regular Expressions that Describe Email Campaigns},
author={Paul Prasse and Christoph Sawade and Niels Landwehr and Tobias Scheffer},
journal={ArXiv},
year={2012},
volume={abs/1206.4637}
}
• Published 18 June 2012
• Computer Science
• ArXiv
This paper addresses the problem of inferring a regular expression from a given set of strings that resembles, as closely as possible, the regular expression that a human expert would have written to identify the language. This is motivated by our goal of automating the task of postmasters of an email service who use regular expressions to describe and blacklist email spam campaigns. Training data contains batches of messages and corresponding regular expressions that an expert postmaster feels…
17 Citations

## Figures from this paper

### Learning to identify concise regular expressions that describe email campaigns

• Computer Science
J. Mach. Learn. Res.
• 2015
This paper addresses the problem of inferring a regular expression from a given set of strings that resembles, as closely as possible, the regular expression that a human expert would have written to

### Regular Expression Guided Entity Mention Mining from Noisy Web Data

• Computer Science
EMNLP
• 2018
This paper explores ways to combine the expressive power of REs, ability of deep learning to learn from large data, and human-in-the loop approach into a new integrated framework for entity identification from web data.

### Inference of Regular Expressions for Text Extraction from Examples

• Computer Science
IEEE Transactions on Knowledge and Data Engineering
• 2016
This work considers the long-standing problem of synthesizing regular expressions automatically, based solely on examples of the desired behavior, and presents the design and implementation of a system capable of addressing extraction tasks of realistic complexity.

### Detecting Clusters of Fake Accounts in Online Social Networks

• Computer Science
AISec@CCS
• 2015
A scalable approach to finding groups of fake accounts registered by the same actor by using a supervised machine learning pipeline for classifying {\em an entire cluster} of accounts as malicious or legitimate.

### A Neural Model for Regular Grammar Induction

• Computer Science
ArXiv
• 2022
This work proposes a novel neural approach to induction of regular grammars from positive and negative examples and its model is fully explainable, its intermediate results are directly interpretable as partial parses, and it can be used to learn arbitrary regulargrammars when provided with sufﬁcient data.

### Correction to “Inference of Regular Expressions for Text Extraction from Examples”

• Mathematics
IEEE Transactions on Knowledge and Data Engineering
• 2016
Presents corrections to typographical errors in the paper, “Inference of regular expressions for text extraction from examples,” (Bartoli, A., et al), IEEE Trans. Knowl. Data Eng., vol. 28, no. 5,

### Regular Expression Based Medical Text Classification Using Constructive Heuristic Approach

• Computer Science
IEEE Access
• 2019
Experimental results show that the machine-generated regular expressions can be effectively used in conjunction with machine learning techniques to perform medical text classification tasks.

### Research and applications: Learning regular expressions for clinical text classification

• Computer Science
J. Am. Medical Informatics Assoc.
• 2014
A novel regular expression discovery (RED) algorithm and two text classifiers based on RED can be combined with other classifiers, like SVM, to improve classification performance.

### Learning regular expressions for clinical text classification

• Computer Science
• 2014
To cite: Bui DDA, ZengTreitler Q. J Am Med Inform Assoc 2014;21:850–857. ABSTRACT Objectives Natural language processing (NLP) applications typically use regular expressions that have been developed

### Trafficking Fraudulent Accounts: The Role of the Underground Market in Twitter Spam and Abuse

• Computer Science
USENIX Security Symposium
• 2013
This work investigates the market for fraudulent Twitter accounts to monitor prices, availability, and fraud perpetrated by 27 merchants over the course of a 10-month period, and develops a classifier to retroactively detect several million fraudulent accounts sold via this marketplace.

## References

SHOWING 1-10 OF 15 REFERENCES

### Bayesian clustering for email campaign detection

• Computer Science
ICML '09
• 2009
An optimization problem is derived that produces a mapping into a space of independent binary feature vectors; the features can reflect arbitrary dependencies in the input space and a case study is presented that evaluates Bayesian clustering for this application.

### Learning Deterministic Regular Expressions for the Inference of Schemas from XML Data

• Computer Science
TWEB
• 2008
A probabilistic algorithm that learns k-occurrence regular expressions for increasing values of k, and selects the one that best describes the sample based on a Minimum Description Length argument is provided.

### Efficient identification of regular expressions from representative examples

It is proved that any regular expression which does not contain union operations can be approximately identified up to the rough equivalence from a long enough cuniform example in polynomial run-time.

### Regular Expression Learning for Information Extraction

• Computer Science
EMNLP
• 2008
It is shown that ReLIE, in addition to being an order of magnitude faster, outperforms CRF under conditions of limited training data and cross-domain data and how the accuracy of CRF can be improved by using features extracted by ReLie.

### Learning Regular Languages from Simple Positive Examples

• F. Denis
• Computer Science
Machine Learning
• 2004
This work uses a learning model from simple examples, where the notion of simplicity is defined with the help of Kolmogorov complexity, and shows that a general and natural heuristic which allows learning from simple positive examples can be developed in this model.

### Efficiently building a parse tree from a regular expression

• Computer Science
Acta Informatica
• 2000
Abstract. We show in this paper that parsing with regular expressions instead of context-free grammars, when it is possible, is desirable. We present efficient algorithms for performing different

### Language Identification in the Limit

• E. M. Gold
• Linguistics, Computer Science
Inf. Control.
• 1967

### Spamming botnets: signatures and characteristics

• Computer Science
SIGCOMM '08
• 2008
An in-depth analysis of the identified botnets revealed several interesting findings regarding the degree of email obfuscation, properties of botnet IP addresses, sending patterns, and their correlation with network scanning traffic.

### Large Margin Methods for Structured and Interdependent Output Variables

• Computer Science
J. Mach. Learn. Res.
• 2005
This paper proposes to appropriately generalize the well-known notion of a separation margin and derive a corresponding maximum-margin formulation and presents a cutting plane algorithm that solves the optimization problem in polynomial time for a large class of problems.