• Corpus ID: 226254450

Training Transformers for Information Security Tasks: A Case Study on Malicious URL Prediction

  title={Training Transformers for Information Security Tasks: A Case Study on Malicious URL Prediction},
  author={Ethan M. Rudd and Ahmed Abdallah},
Machine Learning (ML) for information security (InfoSec) utilizes distinct data types and formats which require different treatments during optimization/training on raw data. In this paper, we implement a malicious/benign URL predictor based on a transformer architecture that is trained from scratch. We show that in contrast to conventional natural language processing (NLP) transformers, this model requires a different training approach to work well. Specifically, we show that 1) pre-training… 
2 Citations

Figures from this paper

Investigating the Influence of Feature Sources for Malicious Website Detection

The contribution is to observe and evaluate combinations of feature sources that have not been studied in the existing literature—primarily involving embeddings extracted with Transformer-type neural networks, which argues that even this somewhat small increase can play a significant role in detecting malicious websites.

A Transformer-based Model to Detect Phishing URLs

A transformer-based malicious URL detection model is introduced, which has significant accuracy and outperforms current detection methods and achieves 97.3% of detection accuracy.



ALOHA: Auxiliary Loss Optimization for Hypothesis Augmentation

This work fits deep neural networks to multiple additional targets derived from metadata in a threat intelligence feed for Portable Executable malware and benignware, including a multi-source malicious/benign loss, a count loss on multi- source detections, and a semantic malware attribute tag loss.

eXpose: A Character-Level Convolutional Neural Network with Embeddings For Detecting Malicious URLs, File Paths and Registry Keys

The eXpose neural network is proposed, which uses a deep learning approach developed to take generic, raw short character strings as input, and learns to simultaneously extract features and classify using character-level embeddings and convolutional neural network.

SMART: Semantic Malware Attribute Relevance Tagging

This work addresses the information gap between ML and signature-based detection methods by introducing an ML-based tagging model that generates human interpretable semantic descriptions of malicious software (e.g. file-infector, coin-miner) and proposes a joint embedding deep neural network architecture that can learn to characterize portable executable files based on static analysis, thus not requiring a dynamic trace to identify behaviors at deployment time.

URLNet: Learning a URL Representation with Deep Learning for Malicious URL Detection

URLNet, an end-to-end deep learning framework to learn a nonlinear URL embedding for Malicious URL Detection directly from the URL is proposed, which allows the model to capture several types of semantic information, which was not possible by the existing models.

Automated U.S diplomatic cables security classification: Topic model pruning vs. classification based on clusters

This paper compares two recent approaches in the literature for text security classification, evaluating them on actual sensitive text data from the WikiLeaks dataset.

EMBER: An Open Dataset for Training Static PE Malware Machine Learning Models

The authors hope that the dataset, code and baseline model provided by EMBER will help invigorate machine learning research for malware detection, in much the same way that benchmark datasets have advanced computer vision research.

MEADE: Towards a Malicious Email Attachment Detection Engine

A dataset of over 5 million malicious/benign Microsoft Office documents along with a smaller data set is collected to provide more realistic estimates of thresholds for false positive rates on in-the-wild data, and deep neural networks and gradient boosted decision trees are able to obtain ROC curves with > 0.99 AUC on both office document and Zip archive datasets.

Automatic Malware Description via Attribute Tagging and Similarity Embedding.

This work addresses the information gap between machine learning and signature-based detection methods by learning a representation space for malware samples in which files with similar malicious behaviors appear close to each other, and introduces a similarity index between malware files.

Automated big security text pruning and classification

This paper examines labeling document sensitivity, labeling each paragraph in the document with one of three levels of security risk, as well as improving upon the base models using probabilistic topic modeling via Latent Dirichlet Analysis.

I-MAD: A Novel Interpretable Malware Detector Using Hierarchical Transformer

This work proposes an Interpretable MAware Detector (I-MAD), which achieves state-of-the-art performance on static malware detection with excellent interpretability and integrates a hierarchical Transformer network that can understand assembly code at the basic block, function, and executable level.