Comparing files using structural entropy

Authors:
Ivan Sorokin
Affiliations:
Doctor Web's Virus Lab, Ltd., Saint-Petersburg, Russia 197101
Venue:
Journal in Computer Virology
Year:
2011

Citing 10
Cited 1

Ten lectures on wavelets

Ten lectures on wavelets
The String-to-String Correction Problem

Journal of the ACM (JACM)
Testing malware detectors

ISSTA '04 Proceedings of the 2004 ACM SIGSOFT international symposium on Software testing and analysis
Static Analyzer of Vicious Executables (SAVE)

ACSAC '04 Proceedings of the 20th Annual Computer Security Applications Conference
Polygraph: Automatically Generating Signatures for Polymorphic Worms

SP '05 Proceedings of the 2005 IEEE Symposium on Security and Privacy
Efficient sequence alignment of network traffic

Proceedings of the 6th ACM SIGCOMM conference on Internet measurement
Using Entropy Analysis to Find Encrypted and Packed Malware

IEEE Security and Privacy
Classification of packed executables for accurate computer virus detection

Pattern Recognition Letters
Network Traffic Classification by Common Subsequence Finding

ICCS '08 Proceedings of the 8th international conference on Computational Science, Part I
Pattern recognition techniques for the classification of malware packers

ACISP'10 Proceedings of the 15th Australasian conference on Information security and privacy

Simple substitution distance and metamorphic detection

Journal in Computer Virology

Quantified Score

Hi-index	0.00

Visualization

Abstract

One of the main trends in the modern anti-virus industry is the development of algorithms that help estimate the similarity of files. Since malware writers tend to use increasingly complex techniques to protect their code such as obfuscation and polymorphism, anti-virus software vendors face problems of the increasing difficulty of file scanning, the considerable growth of anti-virus databases, and file storages overgrowth. For solving such problems, a static analysis of files appears to be of some interest. Its use helps determine those file characteristics that are necessary for their comparison without executing malware samples within a protected environment. The solution provided in this article is based on the assumption that different samples of the same malicious program have a similar order of code and data areas. Each such file area may be characterized not only by its length, but also by its homogeneity. In other words, the file may be characterized by the complexity of its data order. Our approach consists of using wavelet analysis for the segmentation of files into segments of different entropy levels and using edit distance between sequence segments to determine the similarity of the files. The proposed solution has a number of advantages that help detect malicious programs efficiently on personal computers. First, this comparison does not take into account the functionality of analysed files and is based solely on determining the similarity in code and data area positions which makes the algorithm effective against many ways of protecting executable code. On the other hand, such a comparison may result in false alarms. Therefore, our solution is useful as a preliminary test that triggers the running of additional checks. Second, the method is relatively easy to implement and does not require code disassembly or emulation. And, third, the method makes the malicious file record compact which is significant when compiling anti-virus databases.