KiloGrams: Very Large N-Grams for Malware Classification

@article{Raff2019KiloGramsVL,
  title={KiloGrams: Very Large N-Grams for Malware Classification},
  author={Edward Raff and William Fleming and Richard Zak and H. Anderson and Bill Finlayson and Charles K. Nicholas and Mark McLean},
  journal={ArXiv},
  year={2019},
  volume={abs/1908.00200}
}
N-grams have been a common tool for information retrieval and machine learning applications for decades. In nearly all previous works, only a few values of $n$ are tested, with $n > 6$ being exceedingly rare. Larger values of $n$ are not tested due to computational burden or the fear of overfitting. In this work, we present a method to find the top-$k$ most frequent $n$-grams that is 60$\times$ faster for small $n$, and can tackle large $n\geq1024$. Despite the unprecedented size of $n… 

Figures and Tables from this paper

Automatic Yara Rule Generation Using Biclustering
TLDR
This paper uses large n-grams combined with a new biclustering algorithm to construct simple Yara rules more effectively than currently available software, and demonstrates that AutoYara can help reduce analyst workload by producing rules with useful true- positive rates while maintaining low false-positive rates.
Malware Subspecies Detection Method by Suffix Arrays and Machine Learning
  • Kouhei Kita, R. Uda
  • Computer Science
    2021 55th Annual Conference on Information Sciences and Systems (CISS)
  • 2021
TLDR
This work proposed a new malware subspecies detection method by suffix arrays and machine learning that succeeded to classify them with almost 100% accuracy.
A Survey of Machine Learning Methods and Challenges for Windows Malware Classification
TLDR
This survey aims to be useful both to cybersecurity practitioners who wish to learn more about how machine learning can be applied to the malware problem, and to give data scientists the necessary background into the challenges in this uniquely complicated space.
Leveraging Uncertainty for Improved Static Malware Detection Under Extreme False Positive Constraints
TLDR
This work improves the true positive rate (TPR) at an actual realized FPR of 1e-5 from an expected 0.69 for previous methods to 0.80 on the best performing model class on the Sophos industry scale dataset.
YARA-Signator: Automated Generation of Code-based YARA Rules
TLDR
YARA-Signator is presented, an approach for the automated generation of code-based YARA rules that is based on the isolation of instruction n-grams that on the one hand appear frequently within a malware family and on the other hand are not found in any other family.
Malware Detection Using Frequency Domain-Based Image Visualization and Deep Learning
TLDR
A novel method to detect and visualize malware through image classification that is able to generalize well on larger unseen malware samples and the results compare favorably with state-of-the-art static analysis-based malware detection algorithms.
Malware Detection for Forensic Memory Using Deep Recurrent Neural Networks
TLDR
The bidirectional LSTM with Attention proved to be the best model, used on basic block sequences of size 29, and the differences between the model's ROC curves indicate a strong reliance on the lower level, instructional features, as opposed to metadata or string features.
Training Transformers for Information Security Tasks: A Case Study on Malicious URL Prediction
TLDR
A malicious/benign URL predictor based on a transformer architecture that is trained from scratch is implemented and a method for mixed objective optimization, which dynamically balances contributions from both loss terms so that neither one of them dominates, is introduced.
A Quantum Algorithm To Locate Unknown Hashgrams
TLDR
By loading the table of hashes and n -grams into a quantum computer, this work can speed up the process of mapping n-grams to their hashes and prevent one from having to re-compute hashes for a set of n - Grams, which can take on average O ( MN ) time.
Transformers for End-to-End InfoSec Tasks: A Feasibility Study
TLDR
This paper implements transformer models for two distinct InfoSec data formats in a novel end-to-end approach, and introduces a method for mixed objective optimization, which dynamically balances contributions from both loss terms so that neither one of them dominates.
...
...

References

SHOWING 1-10 OF 47 REFERENCES
An investigation of byte n-gram features for malware classification
TLDR
This work discovered a flaw in how previous corpora were created that leads to an over-estimation of classification accuracy, and discovered that most of the information contained in n-grams stem from string features that could be obtained in simpler ways.
What can N-grams learn for malware detection?
TLDR
It is discovered that byte n-grams can learn from the code regions, but do not necessarily learn any new information, and that disambiguating instructions by their binary opcode, an approach not previously used for malware detection, is critical for model generalization.
Hash-Grams: Faster N-Gram Features for Classification and Malware Detection
TLDR
It is shown that the Hash-Gram approach can be up to three orders of magnitude faster than exact top-k selection algorithms, while dramatically reducing computational requirements.
Learning to Detect and Classify Malicious Executables in the Wild
TLDR
The use of machine learning and data mining to detect and classify malicious executables as they appear in the wild is described and it is suggested that the methodology could be used as the basis for an operational system for detecting previously undiscovered malicious executable.
N-gram analysis for computer virus detection
TLDR
A new feature selection measure, class-wise document frequency of byte n-grams, which combines several classifiers using Dempster Shafer Theory for better classification accuracy instead of using any single classifier.
Malware Classification and Class Imbalance via Stochastic Hashed LZJD
TLDR
This work develops the new SHWeL feature vector representation, by extending the recently proposed Lempel-Ziv Jaccard Distance, which provides significantly improved accuracy while reducing algorithmic complexity to O(N).
Large-Scale Identification of Malicious Singleton Files
TLDR
A large-scale study of the properties, characteristics, and distribution of benign and malicious singleton files and builds a classifier based purely on static features to identify 92% of the remaining malicious singletons at a 1.4% percent false positive rate.
BitShred: feature hashing malware for scalable triage and semantic analysis
TLDR
The key idea behind BitShred is using feature hashing to dramatically reduce the high-dimensional feature spaces that are common in malware analysis, and to mine correlated features between malware families and samples using co-clustering techniques.
Detecting unknown malicious code by applying classification techniques on OpCode patterns
TLDR
The imbalance problem is investigated, referring to several real-life scenarios in which malicious files are expected to be about 10% of the total inspected files, and a chronological evaluation showed a clear trend in which the performance improves as the training set is more updated.
AVclass: A Tool for Massive Malware Labeling
TLDR
AVclass is described, an automatic labeling tool that given the AV labels for a, potentially massive, number of samples outputs the most likely family names for each sample, and implements novel automatic techniques to address 3 key challenges: normalization, removal of generic tokens, and alias detection.
...
...