Malicious PDF detection using metadata and structural features

Authors:
Charles Smutz;Angelos Stavrou
Affiliations:
George Mason University, Fairfax, VA;George Mason University, Fairfax, VA
Venue:
Proceedings of the 28th Annual Computer Security Applications Conference
Year:
2012

Citing 12
Cited 1

A Study of Malcode-Bearing Documents

DIMVA '07 Proceedings of the 4th international conference on Detection of Intrusions and Malware, and Vulnerability Assessment
Embedded Malware Detection Using Markov n-Grams

DIMVA '08 Proceedings of the 5th international conference on Detection of Intrusions and Malware, and Vulnerability Assessment
A look at Portable Document Format vulnerabilities

Information Security Tech. Report
Defending Browsers against Drive-by Downloads: Mitigating Heap-Spraying Code Injection Attacks

DIMVA '09 Proceedings of the 6th International Conference on Detection of Intrusions and Malware, and Vulnerability Assessment
Malware detection using statistical analysis of byte-level file content

Proceedings of the ACM SIGKDD Workshop on CyberSecurity and Intelligence Informatics
Detection and analysis of drive-by-download attacks and malicious JavaScript code

Proceedings of the 19th international conference on World wide web
Scalable web object inspection and malfease collection

HotSec'10 Proceedings of the 5th USENIX conference on Hot topics in security
Malicious PDF Documents Explained

IEEE Security and Privacy
Combining static and dynamic analysis for the detection of malicious documents

Proceedings of the Fourth European Workshop on System Security
Static detection of malicious JavaScript-bearing PDF documents

Proceedings of the 27th Annual Computer Security Applications Conference
Auto-learning of SMTP TCP transport-layer features for spam and abusive message detection

LISA'11 Proceedings of the 25th international conference on Large Installation System Administration
A pattern recognition system for malicious PDF files detection

MLDM'12 Proceedings of the 8th international conference on Machine Learning and Data Mining in Pattern Recognition

Looking at the bag is not enough to find the bomb: an evasion of structural methods for malicious PDF files detection

Proceedings of the 8th ACM SIGSAC symposium on Information, computer and communications security

Quantified Score

Hi-index	0.00

Visualization

Abstract

Owed to their versatile functionality and widespread adoption, PDF documents have become a popular avenue for user exploitation ranging from large-scale phishing attacks to targeted attacks. In this paper, we present a framework for robust detection of malicious documents through machine learning. Our approach is based on features extracted from document metadata and structure. Using real-world datasets, we demonstrate the the adequacy of these document properties for malware detection and the durability of these features across new malware variants. Our analysis shows that the Random Forests classification method, an ensemble classifier that randomly selects features for each individual classification tree, yields the best detection rates, even on previously unseen malware. Indeed, using multiple datasets containing an aggregate of over 5,000 unique malicious documents and over 100,000 benign ones, our classification rates remain well above 99% while maintaining low false positives of 0.2% or less for different classification parameters and experimental scenarios. Moreover, the classifier has the ability to detect documents crafted for targeted attacks and separate them from broadly distributed malicious PDF documents. Remarkably, we also discovered that by artificially reducing the influence of the top features in the classifier, we can still achieve a high rate of detection in an adversarial setting where the attacker is aware of both the top features utilized in the classifier and our normality model. Thus, the classifier is resilient against mimicry attacks even with knowledge of the document features, classification method, and training set.