What can N-grams learn for malware detection?

  title={What can N-grams learn for malware detection?},
  author={Richard Zak and Edward Raff and Charles K. Nicholas},
  journal={2017 12th International Conference on Malicious and Unwanted Software (MALWARE)},
Recent work has shown that byte n-grams learn mostly low entropy features, such as function imports and strings, which has brought into question whether byte n-grams can learn information corresponding to higher entropy levels, such as binary code. We investigate that hypothesis in this work by performing byte n-gram analysis on only specific sub-sections of the binary file, and compare to results obtained by n-gram analysis on assembly code generated from disassembled binaries. We do this by… 

Figures and Tables from this paper

KiloGrams: Very Large N-Grams for Malware Classification
This work presents a method to find the top-$k$ most frequent $n$-grams that is 60$\times faster for small $n$, and can tackle large $n\geq1024$.
Hash-Grams: Faster N-Gram Features for Classification and Malware Detection
It is shown that the Hash-Gram approach can be up to three orders of magnitude faster than exact top-k selection algorithms, while dramatically reducing computational requirements.
Malware Subspecies Detection Method by Suffix Arrays and Machine Learning
  • Kouhei KitaR. Uda
  • Computer Science
    2021 55th Annual Conference on Information Sciences and Systems (CISS)
  • 2021
This work proposed a new malware subspecies detection method by suffix arrays and machine learning that succeeded to classify them with almost 100% accuracy.
Instruction Cognitive One-Shot Malware Outbreak Detection
A novel method of detecting semantically similar malware variants within a campaign using a single raw binary malware executable using Discrete Fourier Transform of instruction cognitive representation extracted from self-attention transformer network is presented.
Ensemble Malware Classification Using Neural Networks
This work combines the approach of the winning solution to the Microsoft Malware Classification Challenge with the neural network approach, and uses a combination of n-grams features for both assembly (asm) and byte code to significantly improve the result.
Static Analysis for Malware Detection
A comparison of features extracted from raw byte code, PE header, and assembly code, and then select the best performing set of features and use them to train Gradient Boosting Tree, and shows that proposed approach provides sufficient results for deployment in real applications.
Malware Detection Using Machine Learning and Deep Learning
The results show that the Random Forest outperforms Deep Neural Network with opcode frequency as a feature and Deep Auto-Encoders are overkill for the dataset, and elementary function like Variance Threshold perform better than others.
Using Text Classification Methods to Detect Malware
This paper converts each binary executable to an assembly program, then uses text analytics to classify whether the code is malicious or not, and achieves an F1 accuracy of 86%.
This thesis attempts to use machine-learning techniques to successfully identify previously unknown malware from a set of Windows executable programs by analyzing the histogram of 4-, 8-, and 16-bit-sequence values contained in each program.
Machine-Learning-Based Malware Detection for Virtual Machine by Analyzing Opcode Sequence
This research proposes a novel static analysis method for unknown malware detection based on the feature of opcode n-gram of the executable files, which has the optimal accuracy of 98.2%.


An investigation of byte n-gram features for malware classification
This work discovered a flaw in how previous corpora were created that leads to an over-estimation of classification accuracy, and discovered that most of the information contained in n-grams stem from string features that could be obtained in simpler ways.
Unknown Malcode Detection Using OPCODE Representation
This work presents a full methodology for the detection of unknown malicious code, based on text categorization concepts, and indicates that greater than 99% accuracy can be achieved through the use of a training set that has a malicious file percentage lower than 15%, which is higher than in the previous experience with byte sequence n-gram representation.
Exploring Discriminatory Features for Automated Malware Classification
This work conducts a systematic study on the discriminative power of various types of features extracted from malware programs, and experiment with different combinations of feature selection algorithms and classifiers to offer insights into what features most distinguish malware families.
Using File Relationships in Malware Classification
It is shown that since malicious files are often included in multiple malware containers, the system's detection accuracy can be significantly improved, particularly at low false positive rates which are the main operating points for automated malware classifiers.
Detecting unknown malicious code by applying classification techniques on OpCode patterns
The imbalance problem is investigated, referring to several real-life scenarios in which malicious files are expected to be about 10% of the total inspected files, and a chronological evaluation showed a clear trend in which the performance improves as the training set is more updated.
Learning to Detect and Classify Malicious Executables in the Wild
The use of machine learning and data mining to detect and classify malicious executables as they appear in the wild is described and it is suggested that the methodology could be used as the basis for an operational system for detecting previously undiscovered malicious executable.
BYTEWEIGHT: Learning to Recognize Functions in Binary Code
ByTEWEIGHT, a new automatic function identification algorithm that automatically learns key features for recognizing functions and can therefore easily be adapted to different platforms, new compilers, and new optimizations, is proposed.
Towards Stealthy Malware Detection
This work proposes the use of statistical binary content analysis of files in order to detect suspicious anomalous file segments that may suggest insertion of malcode, and performs tests to determine whether known malcode can be easily distinguished from otherwise “normal” Windows executables, and whether self-encrypted files may be easy to spot.
Malware detection using assembly and API call sequences
This paper presents detection algorithms that can help the anti-virus community to ensure a variant of a known malware can still be detected without the need of creating a signature; a similarity analysis is performed to produce a matrix of similarity scores that can be utilized to determine the likelihood that a piece of code under inspection contains a particular malware.
Deep neural network based malware detection using two dimensional binary program features
A deep neural network based malware detection system that Invincea has developed is introduced, which achieves a usable detection rate at an extremely low false positive rate and scales to real world training example volumes on commodity hardware.