A Study of Malcode-Bearing Documents
DIMVA '07 Proceedings of the 4th international conference on Detection of Intrusions and Malware, and Vulnerability Assessment
Embedded Malware Detection Using Markov n-Grams
DIMVA '08 Proceedings of the 5th international conference on Detection of Intrusions and Malware, and Vulnerability Assessment
A look at Portable Document Format vulnerabilities
Information Security Tech. Report
Defending Browsers against Drive-by Downloads: Mitigating Heap-Spraying Code Injection Attacks
DIMVA '09 Proceedings of the 6th International Conference on Detection of Intrusions and Malware, and Vulnerability Assessment
Malware detection using statistical analysis of byte-level file content
Proceedings of the ACM SIGKDD Workshop on CyberSecurity and Intelligence Informatics
Detection and analysis of drive-by-download attacks and malicious JavaScript code
Proceedings of the 19th international conference on World wide web
Scalable web object inspection and malfease collection
HotSec'10 Proceedings of the 5th USENIX conference on Hot topics in security
Malicious PDF Documents Explained
IEEE Security and Privacy
Combining static and dynamic analysis for the detection of malicious documents
Proceedings of the Fourth European Workshop on System Security
Static detection of malicious JavaScript-bearing PDF documents
Proceedings of the 27th Annual Computer Security Applications Conference
Auto-learning of SMTP TCP transport-layer features for spam and abusive message detection
LISA'11 Proceedings of the 25th international conference on Large Installation System Administration
A pattern recognition system for malicious PDF files detection
MLDM'12 Proceedings of the 8th international conference on Machine Learning and Data Mining in Pattern Recognition
Proceedings of the 8th ACM SIGSAC symposium on Information, computer and communications security
Hi-index | 0.00 |
Owed to their versatile functionality and widespread adoption, PDF documents have become a popular avenue for user exploitation ranging from large-scale phishing attacks to targeted attacks. In this paper, we present a framework for robust detection of malicious documents through machine learning. Our approach is based on features extracted from document metadata and structure. Using real-world datasets, we demonstrate the the adequacy of these document properties for malware detection and the durability of these features across new malware variants. Our analysis shows that the Random Forests classification method, an ensemble classifier that randomly selects features for each individual classification tree, yields the best detection rates, even on previously unseen malware. Indeed, using multiple datasets containing an aggregate of over 5,000 unique malicious documents and over 100,000 benign ones, our classification rates remain well above 99% while maintaining low false positives of 0.2% or less for different classification parameters and experimental scenarios. Moreover, the classifier has the ability to detect documents crafted for targeted attacks and separate them from broadly distributed malicious PDF documents. Remarkably, we also discovered that by artificially reducing the influence of the top features in the classifier, we can still achieve a high rate of detection in an adversarial setting where the attacker is aware of both the top features utilized in the classifier and our normality model. Thus, the classifier is resilient against mimicry attacks even with knowledge of the document features, classification method, and training set.