Learning the PE Header, Malware Detection with Minimal Domain Knowledge

@article{Raff2017LearningTP,
  title={Learning the PE Header, Malware Detection with Minimal Domain Knowledge},
  author={Edward Raff and Jared Sylvester and Charles K. Nicholas},
  journal={Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security},
  year={2017}
}
Many efforts have been made to use various forms of domain knowledge in malware detection. Currently there exist two common approaches to malware detection without domain knowledge, namely byte n-grams and strings. In this work we explore the feasibility of applying neural networks to malware detection and feature learning. We do this by restricting ourselves to a minimal amount of domain knowledge in order to extract a portion of the Portable Executable (PE) header. By doing this we show that… 

Figures and Tables from this paper

Malware Detection on Byte Streams of PDF Files Using Convolutional Neural Networks

TLDR
A convolutional neural network is designed to interpret high-level patterns among collectable spatial clues, thereby predicting whether the given byte sequence has malicious actions or not and it is demonstrated that the proposed network outperform several representative machine-learning models as well as other networks with different settings.

Malware Detection by Exploiting Deep Learning over Binary Programs

TLDR
This paper proposes an end-to-end malware detection framework consisting of convolutional neural network, autoencoder and neural decision trees, which learns the features from multiple domains for malware detection without feature engineering.

An Efficient Approach For Malware Detection Using PE Header Specifications

TLDR
To identify malware programs, features extracted based on the header and PE file structure are used to train several machine learning models and the proposed method identifies malware programs with 95.59% accuracy using only nine features.

Malware Detection by Eating a Whole EXE

TLDR
This work introduces malware detection from raw byte sequences as a fruitful research area to the larger machine learning community and presents the initial work in building a solution to tackle this problem, which has linear complexity dependence on the sequence length, and allows for interpretable sub-regions of the binary to be identified.

Neurlux: dynamic malware analysis without feature engineering

TLDR
This paper proposes Neurlux, a neural network for malware detection that learns automatically from dynamic analysis reports that detail behavioral information, and investigates the learned features of the model and shows which components of the reports it tends to give the highest importance.

Instruction Cognitive One-Shot Malware Outbreak Detection

TLDR
A novel method of detecting semantically similar malware variants within a campaign using a single raw binary malware executable using Discrete Fourier Transform of instruction cognitive representation extracted from self-attention transformer network is presented.

Malware Classification Based on Shallow Neural Network

TLDR
The SNNMAC is proposed, a malware classification model based on shallow neural networks and static analysis that outperforms most of the related works with 99.21% classification precision and reduces the training time by more than half when compared with the methods using DNN (Deep Neural Networks).

Automatic Malware Description via Attribute Tagging and Similarity Embedding.

TLDR
This work addresses the information gap between machine learning and signature-based detection methods by learning a representation space for malware samples in which files with similar malicious behaviors appear close to each other, and introduces a similarity index between malware files.

Learning from Context: Exploiting and Interpreting File Path Information for Better Malware Detection

TLDR
A multi-view neural network is proposed, which takes feature vectors from PE file content as well as corresponding file paths as inputs and outputs a detection score, and finds that the model learns useful aspects of the file path for classification, while also learning artifacts from customers testing the vendor's product.

Improved Deep Learning Model for Static PE Files Malware Detection and Classification

TLDR
A model capable of building a feature set from the dataset and classifying static PE files efficiently is proposed, using dense and dropout layers to minimize the resource strain on the model and deliver more accurate results in less amount of time.
...

References

SHOWING 1-10 OF 82 REFERENCES

An investigation of byte n-gram features for malware classification

TLDR
This work discovered a flaw in how previous corpora were created that leads to an over-estimation of classification accuracy, and discovered that most of the information contained in n-grams stem from string features that could be obtained in simpler ways.

Recognizing Functions in Binaries with Neural Networks

TLDR
It is shown that recurrent neural networks can identify functions in binaries with greater accuracy and efficiency than the state-of-the-art machine-learning-based method.

Improving malware detection by applying multi-inducer ensemble

Malware classification with recurrent networks

TLDR
This work proposes a different approach, which, similar to natural language modeling, learns the language of malware spoken through the executed instructions and extracts robust, time domain features.

PE-Miner: Mining Structural Information to Detect Malicious Executables in Realtime

TLDR
The results show that the extracted features are robust to different packing techniques and PE-Miner is also resilient to majority of crafty evasion strategies.

Unknown Malcode Detection Using OPCODE Representation

TLDR
This work presents a full methodology for the detection of unknown malicious code, based on text categorization concepts, and indicates that greater than 99% accuracy can be achieved through the use of a training set that has a malicious file percentage lower than 15%, which is higher than in the previous experience with byte sequence n-gram representation.

Unknown malcode detection and the imbalance problem

TLDR
This work presents a methodology for the detection of unknown malicious code, which examines concepts from text categorization, based on n-grams extraction from the binary code and feature selection, and indicates that greater than 95% accuracy can be achieved through the use of a training set that has a malicious file content of less than 33.3%.

Structural analysis of binary executable headers for malware detection optimization

TLDR
Structural analysis tests which have been implemented in the DAVFI/OpenDAVFi project accurately detect malware and therefore greatly reduce the number of malware that have to be analyzed by subsequent modules in the detection chain.

McBoost: Boosting Scalability in Malware Collection and Analysis Using Statistical Classification of Executables

TLDR
A fast statistical malware detection tool that is intended to improve the scalability of existing malware collection and analysis approaches, McBoost reduces the overall time of analysis by classifying and filtering out the least suspicious binaries and passing only the most suspicious ones to a detailed binary analysis process for signature extraction.
...