# Learning to Identify Regular Expressions that Describe Email Campaigns

@article{Prasse2012LearningTI, title={Learning to Identify Regular Expressions that Describe Email Campaigns}, author={Paul Prasse and Christoph Sawade and Niels Landwehr and Tobias Scheffer}, journal={ArXiv}, year={2012}, volume={abs/1206.4637} }

This paper addresses the problem of inferring a regular expression from a given set of strings that resembles, as closely as possible, the regular expression that a human expert would have written to identify the language. This is motivated by our goal of automating the task of postmasters of an email service who use regular expressions to describe and blacklist email spam campaigns. Training data contains batches of messages and corresponding regular expressions that an expert postmaster feels…

## 17 Citations

### Learning to identify concise regular expressions that describe email campaigns

- Computer ScienceJ. Mach. Learn. Res.
- 2015

This paper addresses the problem of inferring a regular expression from a given set of strings that resembles, as closely as possible, the regular expression that a human expert would have written to…

### Regular Expression Guided Entity Mention Mining from Noisy Web Data

- Computer ScienceEMNLP
- 2018

This paper explores ways to combine the expressive power of REs, ability of deep learning to learn from large data, and human-in-the loop approach into a new integrated framework for entity identification from web data.

### Inference of Regular Expressions for Text Extraction from Examples

- Computer ScienceIEEE Transactions on Knowledge and Data Engineering
- 2016

This work considers the long-standing problem of synthesizing regular expressions automatically, based solely on examples of the desired behavior, and presents the design and implementation of a system capable of addressing extraction tasks of realistic complexity.

### Detecting Clusters of Fake Accounts in Online Social Networks

- Computer ScienceAISec@CCS
- 2015

A scalable approach to finding groups of fake accounts registered by the same actor by using a supervised machine learning pipeline for classifying {\em an entire cluster} of accounts as malicious or legitimate.

### A Neural Model for Regular Grammar Induction

- Computer ScienceArXiv
- 2022

This work proposes a novel neural approach to induction of regular grammars from positive and negative examples and its model is fully explainable, its intermediate results are directly interpretable as partial parses, and it can be used to learn arbitrary regulargrammars when provided with sufﬁcient data.

### Correction to “Inference of Regular Expressions for Text Extraction from Examples”

- MathematicsIEEE Transactions on Knowledge and Data Engineering
- 2016

Presents corrections to typographical errors in the paper, “Inference of regular expressions for text extraction from examples,” (Bartoli, A., et al), IEEE Trans. Knowl. Data Eng., vol. 28, no. 5,…

### Regular Expression Based Medical Text Classification Using Constructive Heuristic Approach

- Computer ScienceIEEE Access
- 2019

Experimental results show that the machine-generated regular expressions can be effectively used in conjunction with machine learning techniques to perform medical text classification tasks.

### Research and applications: Learning regular expressions for clinical text classification

- Computer ScienceJ. Am. Medical Informatics Assoc.
- 2014

A novel regular expression discovery (RED) algorithm and two text classifiers based on RED can be combined with other classifiers, like SVM, to improve classification performance.

### Learning regular expressions for clinical text classification

- Computer Science
- 2014

To cite: Bui DDA, ZengTreitler Q. J Am Med Inform Assoc 2014;21:850–857. ABSTRACT Objectives Natural language processing (NLP) applications typically use regular expressions that have been developed…

### Trafficking Fraudulent Accounts: The Role of the Underground Market in Twitter Spam and Abuse

- Computer ScienceUSENIX Security Symposium
- 2013

This work investigates the market for fraudulent Twitter accounts to monitor prices, availability, and fraud perpetrated by 27 merchants over the course of a 10-month period, and develops a classifier to retroactively detect several million fraudulent accounts sold via this marketplace.

## References

SHOWING 1-10 OF 15 REFERENCES

### Bayesian clustering for email campaign detection

- Computer ScienceICML '09
- 2009

An optimization problem is derived that produces a mapping into a space of independent binary feature vectors; the features can reflect arbitrary dependencies in the input space and a case study is presented that evaluates Bayesian clustering for this application.

### Learning Deterministic Regular Expressions for the Inference of Schemas from XML Data

- Computer ScienceTWEB
- 2008

A probabilistic algorithm that learns k-occurrence regular expressions for increasing values of k, and selects the one that best describes the sample based on a Minimum Description Length argument is provided.

### Efficient identification of regular expressions from representative examples

- Computer ScienceCOLT '93
- 1993

It is proved that any regular expression which does not contain union operations can be approximately identified up to the rough equivalence from a long enough cuniform example in polynomial run-time.

### Regular Expression Learning for Information Extraction

- Computer ScienceEMNLP
- 2008

It is shown that ReLIE, in addition to being an order of magnitude faster, outperforms CRF under conditions of limited training data and cross-domain data and how the accuracy of CRF can be improved by using features extracted by ReLie.

### Learning Regular Languages from Simple Positive Examples

- Computer ScienceMachine Learning
- 2004

This work uses a learning model from simple examples, where the notion of simplicity is defined with the help of Kolmogorov complexity, and shows that a general and natural heuristic which allows learning from simple positive examples can be developed in this model.

### Efficiently building a parse tree from a regular expression

- Computer ScienceActa Informatica
- 2000

Abstract. We show in this paper that parsing with regular expressions instead of context-free grammars, when it is possible, is desirable. We present efficient algorithms for performing different…

### Spamming botnets: signatures and characteristics

- Computer ScienceSIGCOMM '08
- 2008

An in-depth analysis of the identified botnets revealed several interesting findings regarding the degree of email obfuscation, properties of botnet IP addresses, sending patterns, and their correlation with network scanning traffic.

### Large Margin Methods for Structured and Interdependent Output Variables

- Computer ScienceJ. Mach. Learn. Res.
- 2005

This paper proposes to appropriately generalize the well-known notion of a separation margin and derive a corresponding maximum-margin formulation and presents a cutting plane algorithm that solves the optimization problem in polynomial time for a large class of problems.