A practical approach on clustering malicious PDF documents

Authors:
Cristina Vatamanu;Dragoş Gavriluţ;Răzvan Benchea
Affiliations:
BitDefender AntiMalware Laboratory, Iaşi, Romania and Gheorghe Asachi University, Iaşi, Romania;BitDefender AntiMalware Laboratory, Iaşi, Romania and Alexandru Ioan Cuza University, Iaşi, Romania;BitDefender AntiMalware Laboratory, Iaşi, Romania and Alexandru Ioan Cuza University, Iaşi, Romania
Venue:
Journal in Computer Virology
Year:
2012

Citing 6
Cited 0

SpyProxy: execution-based detection of malicious web content

SS'07 Proceedings of 16th USENIX Security Symposium on USENIX Security Symposium
Introduction to Information Retrieval

Introduction to Information Retrieval
Detection and analysis of drive-by-download attacks and malicious JavaScript code

Proceedings of the 19th international conference on World wide web
Prophiler: a fast filter for the large-scale detection of malicious web pages

Proceedings of the 20th international conference on World wide web
ZOZZLE: fast and precise in-browser JavaScript malware detection

SEC'11 Proceedings of the 20th USENIX conference on Security
Fast plagiarism detection system

SPIRE'05 Proceedings of the 12th international conference on String Processing and Information Retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

Starting with 2009, the number of advanced persistent threat attacks has increased. In all of the researched cases, this kind of attacks use a zero-day exploit usually found in a frequently used application. Most of the times, the user has to visit a malicious page or open an infected document sent via e-mail. Even though the attack vector can be found in many forms, this paper addresses the case in which the attack relies on PDF files to deliver the payload. We chose PDF format both because of the high number of attacks it was used in and the key advantages it offers to the attacker. From an attackers perspective, the advantage of this attack is clear in that the PDF-files can be opened by an application on the users computer or in a browser, as most of the browsers support plug-ins that can render PDF files. The use of JavaScript inside PDF files offers two further advantages. The first is that code can be executed on the victims computer while the attack avoids different protection methods. The second benefit is that the JavaScript code can be polymorphic in that two files with the same functionality may look very different. This paper unveils a clustering method based on tokenization of the JavaScript code inside PDF files resistant to most of the obfuscation techniques used in script-based malware pieces. Our clustering method is based on the fact that most of the infected PDF-files (over 93 %) are using JavaScript code. By tokenizing the JavaScript code, describing it in an abstract manner and eliminating different operators used in polymorphism, we are able to obtain classes of files, very similar syntax-wise that can be easily clustered using different methods. Given the fact that virus analysts would likely analyse classes of files rather than isolated files, their work will be significantly reduced. The method of abstraction can be taken one step further and used as a detection mechanism--a technique to evaluate prevalent data or to obtain a subset from a large set without losing data variability.