• Corpus ID: 416685

Learning to Identify Regular Expressions that Describe Email Campaigns

@article{Prasse2012LearningTI,
  title={Learning to Identify Regular Expressions that Describe Email Campaigns},
  author={Paul Prasse and Christoph Sawade and Niels Landwehr and Tobias Scheffer},
  journal={ArXiv},
  year={2012},
  volume={abs/1206.4637}
}
This paper addresses the problem of inferring a regular expression from a given set of strings that resembles, as closely as possible, the regular expression that a human expert would have written to identify the language. This is motivated by our goal of automating the task of postmasters of an email service who use regular expressions to describe and blacklist email spam campaigns. Training data contains batches of messages and corresponding regular expressions that an expert postmaster feels… 

Figures from this paper

Learning to identify concise regular expressions that describe email campaigns

This paper addresses the problem of inferring a regular expression from a given set of strings that resembles, as closely as possible, the regular expression that a human expert would have written to

Regular Expression Guided Entity Mention Mining from Noisy Web Data

This paper explores ways to combine the expressive power of REs, ability of deep learning to learn from large data, and human-in-the loop approach into a new integrated framework for entity identification from web data.

Inference of Regular Expressions for Text Extraction from Examples

This work considers the long-standing problem of synthesizing regular expressions automatically, based solely on examples of the desired behavior, and presents the design and implementation of a system capable of addressing extraction tasks of realistic complexity.

Detecting Clusters of Fake Accounts in Online Social Networks

A scalable approach to finding groups of fake accounts registered by the same actor by using a supervised machine learning pipeline for classifying {\em an entire cluster} of accounts as malicious or legitimate.

A Neural Model for Regular Grammar Induction

This work proposes a novel neural approach to induction of regular grammars from positive and negative examples and its model is fully explainable, its intermediate results are directly interpretable as partial parses, and it can be used to learn arbitrary regulargrammars when provided with sufficient data.

Correction to “Inference of Regular Expressions for Text Extraction from Examples”

Presents corrections to typographical errors in the paper, “Inference of regular expressions for text extraction from examples,” (Bartoli, A., et al), IEEE Trans. Knowl. Data Eng., vol. 28, no. 5,

Regular Expression Based Medical Text Classification Using Constructive Heuristic Approach

Experimental results show that the machine-generated regular expressions can be effectively used in conjunction with machine learning techniques to perform medical text classification tasks.

Research and applications: Learning regular expressions for clinical text classification

A novel regular expression discovery (RED) algorithm and two text classifiers based on RED can be combined with other classifiers, like SVM, to improve classification performance.

Learning regular expressions for clinical text classification

To cite: Bui DDA, ZengTreitler Q. J Am Med Inform Assoc 2014;21:850–857. ABSTRACT Objectives Natural language processing (NLP) applications typically use regular expressions that have been developed

Trafficking Fraudulent Accounts: The Role of the Underground Market in Twitter Spam and Abuse

This work investigates the market for fraudulent Twitter accounts to monitor prices, availability, and fraud perpetrated by 27 merchants over the course of a 10-month period, and develops a classifier to retroactively detect several million fraudulent accounts sold via this marketplace.

References

SHOWING 1-10 OF 15 REFERENCES

Bayesian clustering for email campaign detection

An optimization problem is derived that produces a mapping into a space of independent binary feature vectors; the features can reflect arbitrary dependencies in the input space and a case study is presented that evaluates Bayesian clustering for this application.

Learning Deterministic Regular Expressions for the Inference of Schemas from XML Data

A probabilistic algorithm that learns k-occurrence regular expressions for increasing values of k, and selects the one that best describes the sample based on a Minimum Description Length argument is provided.

Efficient identification of regular expressions from representative examples

It is proved that any regular expression which does not contain union operations can be approximately identified up to the rough equivalence from a long enough cuniform example in polynomial run-time.

Regular Expression Learning for Information Extraction

It is shown that ReLIE, in addition to being an order of magnitude faster, outperforms CRF under conditions of limited training data and cross-domain data and how the accuracy of CRF can be improved by using features extracted by ReLie.

Learning Regular Languages from Simple Positive Examples

  • F. Denis
  • Computer Science
    Machine Learning
  • 2004
This work uses a learning model from simple examples, where the notion of simplicity is defined with the help of Kolmogorov complexity, and shows that a general and natural heuristic which allows learning from simple positive examples can be developed in this model.

Efficiently building a parse tree from a regular expression

Abstract. We show in this paper that parsing with regular expressions instead of context-free grammars, when it is possible, is desirable. We present efficient algorithms for performing different

Algorithms for learning regular expressions from positive data

Language Identification in the Limit

  • E. M. Gold
  • Linguistics, Computer Science
    Inf. Control.
  • 1967

Spamming botnets: signatures and characteristics

An in-depth analysis of the identified botnets revealed several interesting findings regarding the degree of email obfuscation, properties of botnet IP addresses, sending patterns, and their correlation with network scanning traffic.

Large Margin Methods for Structured and Interdependent Output Variables

This paper proposes to appropriately generalize the well-known notion of a separation margin and derive a corresponding maximum-margin formulation and presents a cutting plane algorithm that solves the optimization problem in polynomial time for a large class of problems.