• Corpus ID: 219708567

A Survey of Machine Learning Methods and Challenges for Windows Malware Classification

  title={A Survey of Machine Learning Methods and Challenges for Windows Malware Classification},
  author={Edward Raff and Charles K. Nicholas},
Malware classification is a difficult problem, to which machine learning methods have been applied for decades. Yet progress has often been slow, in part due to a number of unique difficulties with the task that occur through all stages of the developing a machine learning system: data collection, labeling, feature creation and selection, model selection, and evaluation. In this survey we will review a number of the current methods and challenges related to malware classification, including… 

Figures and Tables from this paper

Backdoor Attacks and Countermeasures on Deep Learning: A Comprehensive Review

This work provides the community with a timely comprehensive review of backdoor attacks and countermeasures on deep learning, and presents key areas for future research on the backdoor, such as empirical security evaluations from physical trigger attacks, and more efficient and practical countermeasures are solicited.

Automatic Yara Rule Generation Using Biclustering

This paper uses large n-grams combined with a new biclustering algorithm to construct simple Yara rules more effectively than currently available software, and demonstrates that AutoYara can help reduce analyst workload by producing rules with useful true- positive rates while maintaining low false-positive rates.

Practical Cross-modal Manifold Alignment for Grounded Language

A cross-modality manifold alignment procedure that leverages triplet loss to jointly learn consistent, multi-modal embeddings of language-based concepts of real-world items that outperforms four baselines, including a state-of-the-art approach, across five evaluation metrics.

Marvolo: Programmatic Data Augmentation for Practical ML-Driven Malware Detection

M ARVOLO is a binary mutator that programmatically grows malware (and benign) datasets in a manner that boosts the accuracy of ML-driven malware detectors and embeds several key optimizations that keep costs low for practitioners by maximizing the density of diverse data samples generated within a given time (or resource) budget.

Malware and Ransomware Detection Models

A novel and exible ransomware detection model that combines two optimized models is introduced that demonstrates good accuracy and F1 scores.

Parallel Instance Filtering for Malware Detection

The PIF algorithm outperforms existing instance selection methods used in the experiments in terms of the ratio between average classification accuracy and storage percentage and outperforms several state-of-the-art instance selection algorithms on a large data set of 500,000 malicious and benign samples.

MERLIN - Malware Evasion with Reinforcement LearnINg

This paper proposes a method using reinforcement learning with DQN and REINFORCE algorithms to challenge two state-of-the-art ML-based detection engines and a commercial antivirus (AV) classified by Gartner as a leader AV and demonstrates that REIN FORCE achieves very good evasion rates even on a commercial AV with limited available information.

Toward the Detection of Polyglot Files

These models outperformed existing methods and could be incorporated into a malware detector’s file processing pipeline to filter out potentially malicious polyglots before file type-dependent feature extraction takes place.



DL 4 MD : A Deep Learning Framework for Intelligent Malware Detection

This paper studies how a deep learning architecture using the stacked AutoEncoders (SAEs) model can be designed for intelligent malware detection based on the Windows Application Programming Interface (API) calls extracted from the Portable Executable (PE) files.

Exploring Discriminatory Features for Automated Malware Classification

This work conducts a systematic study on the discriminative power of various types of features extracted from malware programs, and experiment with different combinations of feature selection algorithms and classifiers to offer insights into what features most distinguish malware families.

An investigation of byte n-gram features for malware classification

This work discovered a flaw in how previous corpora were created that leads to an over-estimation of classification accuracy, and discovered that most of the information contained in n-grams stem from string features that could be obtained in simpler ways.

When Malware is Packin' Heat; Limits of Machine Learning Classifiers Based on Static Analysis Features

It is demonstrated that the signals extracted from packed executables are not rich enough for machine-learning-based models to generalize their knowledge to operate on unseen packers, and be robust against adversarial examples.

HDM-Analyser: a hybrid analysis approach based on data mining techniques for malware detection

A novel hybrid approach, HDM-Analyser, is presented which takes advantages of dynamic and static analysis methods for rising speed while preserving the accuracy in a reasonable level and achieves better overall accuracy and time complexity than static and dynamic analysis methods.

Analysis of Machine learning Techniques Used in Behavior-Based Malware Detection

It can be concluded that a proof-of-concept based on automatic behavior-based malware analysis and the use of machine learning techniques could detect malware quite effectively and efficiently.

Ensemble Models for Data-driven Prediction of Malware Infections

ESM can effectively predict malware infection ratios over time upto 4 times better compared to several baselines on various metrics, and its performance is stable and robust even when the number of detected infections is low.

Learning and Classification of Malware Behavior

The effectiveness of the proposed method for learning and discrimination of malware behavior is demonstrated, especially in detecting novel instances of malware families previously not recognized by commercial anti-virus software.

Deep Learning for Classification of Malware System Call Sequences

The increase in number and variety of malware samples amplifies the need for improvement in automatic detection and classification of the malware variants, and neural network methodology has been grown to the state that can surpass limitations of previous machine learning methods.

MtNet: A Multi-Task Neural Network for Dynamic Malware Classification

A new multi-task, deep learning architecture for malware classification for the binary i.e. malware versus benign malware classification task, which achieves a binary classification error rate of 0.358i¾?%, and for the first time, sees improvements using multiple layers in a deep neural network architecture for ransomware classification.