Malware Detection in PDF and Office Documents: A survey

@article{Singh2020MalwareDI,
  title={Malware Detection in PDF and Office Documents: A survey},
  author={Priyanshi Singh and Shashikala Tapaswi and Sanchit Gupta},
  journal={Information Security Journal: A Global Perspective},
  year={2020},
  volume={29},
  pages={134 - 153}
}
ABSTRACT In 2018, with the internet being treated as a utility on equal grounds as clean water or air, the underground malicious software economy is flourishing with an influx of growth and sophistication in the attacks. The use of malicious documents has increased rapidly in the last five years along with a spectrum of attacks. They offer flexibility in document structure with numerous features for attackers to exploit. Despite efforts from industry and research communities, this remains a… 
Detection of macro-based attacks in office documents using Machine Learning
TLDR
A broad classification of macro based malicious document attack is provided along with a detailed description of the attack opportunities available using office documents and a hybrid malware analysis technique is proposed which thoroughly analyze the file for any macro attacks.
Analysis and Correlation of Visual Evidence in Campaigns of Malicious Office Documents
TLDR
This article proposes a mechanism to extract and analyse the different components of the files, including these visual elements, and construct lightweight signatures based on them, and test and validate the approach using an extensive database of malware samples, obtaining accuracy above 99% in the task of distinguishing between benign and malicious files.
HAPSSA: Holistic Approach to PDF malware detection using Signal and Statistical Analysis
TLDR
This paper derives a simple yet effective holistic approach to PDF malware detection that leverages signal and statistical analysis of malware binaries and shows that this holistic approach maintains a high detection rate of PDF malware and even detects new malicious files created by simple methods.
Toward Robust Classifiers for PDF Malware Detection
TLDR
This study proposes two models for PDF malware detection that can distinguish the different vulnerabilities exploited in malicious files and achieve excellent performance in terms of generalization ability, accuracy, and robustness.
Detecting malicious PDF using CNN
TLDR
This work proposes a novel algorithm that uses an ensemble of Convolutional Neural Network on the byte level of the file, without any handcrafted features to maintain a high detection rate of PDF malware and even detects new malicious files, still undetected by most antiviruses.
Invasive weed optimization with stacked long short term memory for PDF malware detection and classification
TLDR
An Invasive Weed Optimization with Stacked Long Short Term Memory (IWO-S-LSTM) technique for PDF malware detection and classification and the experimental outcomes outperformed the promising performance of the IWO -S- LSTM technique on the other approaches.
An Improved Method of Detecting Macro Malware on an Imbalanced Dataset
TLDR
This paper proposes an improved method of detecting macro malware on an imbalanced dataset that mitigates the class imbalance problem and could detect completely new malware regardless of the family type and reveals that LSI is more robust than Doc2vec to theclass imbalance problem.

References

SHOWING 1-10 OF 94 REFERENCES
Identifying Drawbacks in Malicious PDF Detectors
TLDR
A survey of all recent malicious PDF detectors, followed by a comparative evaluation of the available tools shows that Concept drifts is major drawback to the detectors, despite the fact that many detectors use machine learning approaches.
BISSAM: Automatic Vulnerability Identification of Office Documents
TLDR
This paper presents a novel approach to detect and identify the actual vulnerability exploited by a malicious document and extract the exploit code itself from a security patch.
PDF Scrutinizer: Detecting JavaScript-based attacks in PDF documents
TLDR
This paper uses static, as well as, dynamic techniques to detect malicious behavior in an emulated environment, and shows that PDF Scrutinizer reliably detects current malicious documents, while keeping a low false-positive rate and reasonable runtime performance.
A survey on malware propagation, analysis, and detection
TLDR
A detailed review has been conducted on the current situation of malware infection and the work done to improve anti-malware or malware detection systems and provides an up-to-date comparative reference for developers of malware detection systems.
Static detection of malicious JavaScript-bearing PDF documents
TLDR
This contribution presents a technique for detection of JavaScript-bearing malicious PDF documents based on static analysis of extracted JavaScript code that has proved to be effective against both known and unknown malware and suitable for large-scale batch processing.
A Survey on Malware and Malware Detection Systems
TLDR
A detailed review has been conducted on the current situation of malware infection and the work done to improve anti-malware or malware detection systems and provides an up-to-date comparative reference for developers of malware detection system.
Hidost: a static machine-learning-based detector of malicious files
TLDR
Hidost is introduced, the first static machine-learning-based malware detection system designed to operate on multiple file formats and outperformed all antivirus engines deployed by the website VirusTotal to detect the highest number of malicious PDF files and ranked among the best on SWF malware.
A Pattern Recognition System for Malicious PDF Files Detection
TLDR
An innovative technique, which combines a feature extractor module strongly related to the structure of PDF files and an effective classifier, is presented, which has proven to be more effective than other state-of-the-art research tools for malicious PDF detection.
...
...