Automatic Yara Rule Generation Using Biclustering

  title={Automatic Yara Rule Generation Using Biclustering},
  author={Edward Raff and Richard Zak and Gary Lopez Munoz and William Fleming and H. Anderson and Bobby Filar and Charles K. Nicholas and James Holt},
  journal={Proceedings of the 13th ACM Workshop on Artificial Intelligence and Security},
  • Edward Raff, Richard Zak, James Holt
  • Published 6 September 2020
  • Computer Science
  • Proceedings of the 13th ACM Workshop on Artificial Intelligence and Security
Yara rules are a ubiquitous tool among cybersecurity practitioners and analysts. Developing high-quality Yara rules to detect a malware family of interest can be labor- and time-intensive, even for expert users. Few tools exist and relatively little work has been done on how to automate the generation of Yara rules for specific families. In this paper, we leverage large n-grams (n ≥ 8) combined with a new biclustering algorithm to construct simple Yara rules more effectively than currently… 

Figures and Tables from this paper

Marvolo: Programmatic Data Augmentation for Practical ML-Driven Malware Detection
M ARVOLO is a binary mutator that programmatically grows malware (and benign) datasets in a manner that boosts the accuracy of ML-driven malware detectors and embeds several key optimizations that keep costs low for practitioners by maximizing the density of diverse data samples generated within a given time (or resource) budget.
Pattern Matching in YARA: Improved Aho-Corasick Algorithm
This paper has discovered several reasons why regular expressions can slow down scanning based on the nature of the used algorithm, Aho-Corasick, and proposed a new version of this algorithm and implemented it in the original version of YARA.
Jigsaw Puzzle: Selective Backdoor Attack to Subvert Malware Classifiers
This paper proposes a new attack, Jigsaw Puzzle (JP), based on the key observation that malware authors have little to no incentive to protect any other authors’ malware but their own, which is effective as a backdoor, remains stealthy against state-of-the-art defenses, and is a threat in realistic settings that depart from reasoning about feature-space only attacks.
Opleiding Informatica Threat Intelligence Feed For Mobile Applications
This thesis aims to create a threat intelligence feed composed of IOCs found within YARA rules and metadata associated with the rules, which results in a collection of 9095 IOCs distributed over 78 types of malware.
DEFInit: An Analysis of Exposed Android Init Routines
It was found that custom Init routines added by vendors were substantial and had significant security impact, allowing unprivileged apps to perform sensitive functionalities without user interaction, including disabling SELinux enforcement, sniffing network traffic, reading system logs, among others.
Healthcare Biclustering-Based Prediction on Gene Expression Dataset
The results show that proposed FCM blustering method has higher average match score, and reduced run time proposedFCM than the existing PSO-SA and fuzzy logic healthcare biclustering methods.
Goodness-of-fit Test on the Number of Biclusters in Relational Data Matrix
A new statistical test on the number of biclusters that does not require the regular-grid assumption is proposed, and the asymptotic behavior of the proposed test statistic is derived in both null and alternative cases.
Identifying Authorship Style in Malicious Binaries: Techniques, Challenges & Datasets
The largest and most diverse metainformation dataset of 15,660 malware labeled to 164 threat actor groups is published, to mitigate the lack of ground truth datasets in this domain.


KiloGrams: Very Large N-Grams for Malware Classification
This work presents a method to find the top-$k$ most frequent $n$-grams that is 60$\times faster for small $n$, and can tackle large $n\geq1024$.
AVclass: A Tool for Massive Malware Labeling
AVclass is described, an automatic labeling tool that given the AV labels for a, potentially massive, number of samples outputs the most likely family names for each sample, and implements novel automatic techniques to address 3 key challenges: normalization, removal of generic tokens, and alias detection.
Prudent Practices for Designing Malware Experiments: Status Quo and Outlook
Study of methodological rigor and prudence in 36 academic publications from 2006-2011 that rely on malware execution finds frequent shortcomings, including problematic assumptions regarding the use of execution-driven datasets, absence of description of security precautions taken during experiments, and oftentimes insufficient description of the experimental setup.
Empirical assessment of machine learning-based malware detectors for Android
The purpose of malware detection is revisits to discuss whether such in the lab validation scenarios provide reliable indications on the performance of malware detectors in real-world settings, aka in the wild.
Malware Classification and Class Imbalance via Stochastic Hashed LZJD
This work develops the new SHWeL feature vector representation, by extending the recently proposed Lempel-Ziv Jaccard Distance, which provides significantly improved accuracy while reducing algorithmic complexity to O(N).
Automatic Generation of String Signatures for Malware Detection
Hancock is the first string signature generation system that takes on this challenge on a large scale and features a scalable model that estimates the occurrence probability of arbitrary byte sequences in goodware programs, a set of library code identification techniques, and diversity-based heuristics that ensure the contexts in which a signature is embedded in containing malware files are similar to one another.
Misleading worm signature generators using deliberate noise injection
A new and general class of attacks whereby a worm can combine polymorphism and misleading behavior to intentionally pollute the dataset of suspicious flows during its propagation and successfully mislead the automatic signature generation process is described.
DeepSign: Deep learning for automatic malware signature generation and classification
  • O. David, N. Netanyahu
  • Computer Science
    2015 International Joint Conference on Neural Networks (IJCNN)
  • 2015
The results presented in this paper show that signatures generated by the DBN allow for an accurate classification of new malware variants, and the presented method achieves 98.6% classification accuracy using the signatures Generating malware signatures.
Polygraph: automatically generating signatures for polymorphic worms
This paper presents Polygraph, a signature generation system that successfully produces signatures that match polymorphic worms, and proposes classes of signature suited for matching polymorphic worm payloads and presents algorithms for automatic generation of signatures in these classes.
Would a File by Any Other Name Seem as Malicious?
It is demonstrated that file names can contain information predictive of the presence of malware in a file, and the effectiveness of a character-level convolutional neural network at predicting malware status using file names on Endgame’s EMBER malware detection benchmark dataset is shown.