• Publications
  • Influence
An investigation of byte n-gram features for malware classification
TLDR
This work discovered a flaw in how previous corpora were created that leads to an over-estimation of classification accuracy, and discovered that most of the information contained in n-grams stem from string features that could be obtained in simpler ways.
Static Malware Detection & Subterfuge: Quantifying the Robustness of Machine Learning and Current Anti-Virus
TLDR
Through these experiments, it is able to show in a quantifiable way how purely ML based systems can be more robust than AV products at detecting malware that attempts evasion through modification, but may be slower to adapt in the face of significantly novel attacks.
Classifying Sequences of Extreme Length with Constant Memory Applied to Malware Detection
TLDR
This work develops a new approach to temporal max pooling that makes the required memory invariant to the sequence length $T$, which makes MalConv more memory efficient, and up to $25.8\times$ faster to train on its original dataset, while removing the input length restrictions to Malconv.
KiloGrams: Very Large N-Grams for Malware Classification
TLDR
This work presents a method to find the top-$k$ most frequent $n$-grams that is 60$\times faster for small $n$, and can tackle large $n\geq1024$.
Creating Cybersecurity Knowledge Graphs From Malware After Action Reports
TLDR
This paper describes a system to extract information from AARs, aggregate the extracted information by fusing similar entities together, and represent that extracted information in a Cybersecurity Knowledge Graph (CKG).
RelExt: Relation Extraction using Deep Learning approaches for Cybersecurity Knowledge Graph Improvement
TLDR
This work proposes a system to create semantic triples over cybersecurity text, using deep learning approaches to extract possible relationships and uses the set of semantic tri triple generated through this system to assert in a cybersecurity knowledge graph.
What can N-grams learn for malware detection?
TLDR
It is discovered that byte n-grams can learn from the code regions, but do not necessarily learn any new information, and that disambiguating instructions by their binary opcode, an approach not previously used for malware detection, is critical for model generalization.
Automatic Yara Rule Generation Using Biclustering
TLDR
This paper uses large n-grams combined with a new biclustering algorithm to construct simple Yara rules more effectively than currently available software, and demonstrates that AutoYara can help reduce analyst workload by producing rules with useful true- positive rates while maintaining low false-positive rates.
RelExt