An investigation of byte n-gram features for malware classification

@article{Raff2016AnIO,
  title={An investigation of byte n-gram features for malware classification},
  author={Edward Raff and Richard Zak and Russell Cox and Jared Sylvester and Paul Yacci and Rebecca Ward and Anna Tracy and Mark McLean and Charles K. Nicholas},
  journal={Journal of Computer Virology and Hacking Techniques},
  year={2016},
  volume={14},
  pages={1-20}
}
Malware classification using machine learning algorithms is a difficult task, in part due to the absence of strong natural features in raw executable binary files. [] Key Method We compute a regularization path and analyze novel multi-byte identifiers. Through this process, we discover significant previously unreported issues with byte n-gram features that cause their benefits and practicality to be overestimated. Three primary issues emerged from our work. First, we discovered a flaw in how previous corpora…
What can N-grams learn for malware detection?
TLDR
It is discovered that byte n-grams can learn from the code regions, but do not necessarily learn any new information, and that disambiguating instructions by their binary opcode, an approach not previously used for malware detection, is critical for model generalization.
Hash-Grams: Faster N-Gram Features for Classification and Malware Detection
TLDR
It is shown that the Hash-Gram approach can be up to three orders of magnitude faster than exact top-k selection algorithms, while dramatically reducing computational requirements.
KiloGrams: Very Large N-Grams for Malware Classification
TLDR
This work presents a method to find the top-$k$ most frequent $n$-grams that is 60$\times faster for small $n$, and can tackle large $n\geq1024$.
Ensemble Malware Classification Using Neural Networks
TLDR
This work combines the approach of the winning solution to the Microsoft Malware Classification Challenge with the neural network approach, and uses a combination of n-grams features for both assembly (asm) and byte code to significantly improve the result.
A Survey of Machine Learning Methods and Challenges for Windows Malware Classification
TLDR
This survey aims to be useful both to cybersecurity practitioners who wish to learn more about how machine learning can be applied to the malware problem, and to give data scientists the necessary background into the challenges in this uniquely complicated space.
Malware Classification and Class Imbalance via Stochastic Hashed LZJD
TLDR
This work develops the new SHWeL feature vector representation, by extending the recently proposed Lempel-Ziv Jaccard Distance, which provides significantly improved accuracy while reducing algorithmic complexity to O(N).
Fusing Feature Engineering and Deep Learning: A Case Study for Malware Classification
TLDR
This paper presents an hybrid approach to address the task of malware classification by fusing multiple types of features defined by experts and features learned through deep learning from raw data that achieves state-of-the-art performance and outperforms gradient boosting and deep learning methods in the literature.
Information gain score computation for N-grams using multiprocessing model
TLDR
A multiprocessing model that computes IG scores rapidly for larger N-Gram datasets for heuristic analysis and is 80% faster than the sequential model of IG score computation.
MALGRA: Machine Learning and N-Gram Malware Feature Extraction and Detection System
TLDR
This paper uses a dynamic analysis technique to extract an Indicator of Compromise (IOC) for malicious files, which are represented using N-grams, and proposes TF-IDF as a novel alternative used to identify the most significant N- Gram features for training a machine learning algorithm.
Orthrus: A Bimodal Learning Architecture for Malware Classification
TLDR
In this work, Orthrus is introduced, a new bimodal approach to categorize malware into families based on deep learning that achieves higher classification performance than deep learning approaches in the literature and n-gram based methods.
...
...

References

SHOWING 1-10 OF 48 REFERENCES
N-gram analysis for computer virus detection
TLDR
A new feature selection measure, class-wise document frequency of byte n-grams, which combines several classifiers using Dempster Shafer Theory for better classification accuracy instead of using any single classifier.
Learning to Detect and Classify Malicious Executables in the Wild
TLDR
The use of machine learning and data mining to detect and classify malicious executables as they appear in the wild is described and it is suggested that the methodology could be used as the basis for an operational system for detecting previously undiscovered malicious executable.
Byte Level n–Gram Analysis for Malware Detection
TLDR
Experimental results are promising and shows that the proposed approach can be used to effectively classify executables (Malware and Benign) minimizing false alarms.
Mal-ID: Automatic Malware Detection Using Common Segment Analysis and Meta-Features
TLDR
This paper proposes several novel methods, based on machine learning, to detect malware in executable files without any need for preprocessing, such as unpacking or disassembling, and introduces two Mal- ID extensions that improve the Mal-ID basic method in various aspects.
Unknown malcode detection and the imbalance problem
TLDR
This work presents a methodology for the detection of unknown malicious code, which examines concepts from text categorization, based on n-grams extraction from the binary code and feature selection, and indicates that greater than 95% accuracy can be achieved through the use of a training set that has a malicious file content of less than 33.3%.
McBoost: Boosting Scalability in Malware Collection and Analysis Using Statistical Classification of Executables
TLDR
A fast statistical malware detection tool that is intended to improve the scalability of existing malware collection and analysis approaches, McBoost reduces the overall time of analysis by classifying and filtering out the least suspicious binaries and passing only the most suspicious ones to a detailed binary analysis process for signature extraction.
PE-Miner: Mining Structural Information to Detect Malicious Executables in Realtime
TLDR
The results show that the extracted features are robust to different packing techniques and PE-Miner is also resilient to majority of crafty evasion strategies.
N-grams-based File Signatures for Malware Detection
TLDR
It is shown that n-grams signatures provide an effective way to detect unknown malware while keeping low false positive ratio.
Automatic Generation of String Signatures for Malware Detection
TLDR
Hancock is the first string signature generation system that takes on this challenge on a large scale and features a scalable model that estimates the occurrence probability of arbitrary byte sequences in goodware programs, a set of library code identification techniques, and diversity-based heuristics that ensure the contexts in which a signature is embedded in containing malware files are similar to one another.
...
...